tensorflow/contrib/lite/models/testdata/g3doc/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

## Speech Model Tests

Sample test data has been provided for speech related models in Tensorflow Lite
to help users working with speech models to verify and test their models.

For the hotword, speaker-id and automatic speech recognition sample models, the
architecture assumes that the models receive their input from a speech
pre-processing module. The speech pre-processing module receives the audio
signal and produces features for the encoder neural network and uses some
typical signal processing algorithms, like FFT and spectral subtraction, and
ultimately produces a log-mel filterbank (the log of the triangular mel filters
applied to the power spectra). The text-to-speech model assumes that the inputs
are linguistic features describing characteristics of phonemes, syllables,
words, phrases, and sentence. The outputs are acoustic features including
mel-cepstral coefficients, log fundamental frequency, and band aperiodicity.
The pre-processing modules for these models are not provided in the open source
version of TensorFlow Lite.

The following sections describe the architecture of the sample models at a high
level:

### Hotword Model

The hotword model is the neural network model we use for keyphrase/hotword
spotting (i.e. "okgoogle" detection). It is the entry point for voice
interaction (e.g. Google search app on Android devices or Google Home, etc.).
The speech hotword model block diagram is shown in Figure below. It has an input
size of 40 (float), an output size of 7 (float), one Svdf layer, and four fully
connected layers with the corresponding parameters as shown in figure below.

![hotword_model](hotword.svg "Hotword model")

### Speaker-id Model

The speaker-id model is the neural network model we use for speaker
verification. It runs after the hotword triggers. The speech speaker-id model
block diagram is shown in Figure below. It has an input size of 80 (float), an
output size of 64 (float), three Lstm layers, and one fully connected layers
with the corresponding parameters as shown in figure below.

![speakerid_model](speakerid.svg "Speaker-id model")

### Text-to-speech (TTS) Model

The text-to-speech model is the neural network model used to generate speech
from text. The speech text-to-speech model’s block diagram is shown
in Figure below. It has and input size of 334 (float), an output size of 196
(float), two fully connected layers, three Lstm layers, and one recurrent layer
with the corresponding parameters as shown in the figure.

![tts_model](tts.svg "TTS model")

### Automatic Speech Recognizer (ASR) Acoustic Model (AM)

The acoustic model for automatic speech recognition is the neural network model
for matching phonemes to the input autio features. It generates posterior
probabilities of phonemes from speech frontend features (log-mel filterbanks).
It has an input size of 320 (float), an output size of 42 (float), five LSTM
layers and one fully connected layers with a Softmax activation function, with
the corresponding parameters as shown in the figure.

![asr_am_model](asr_am.svg "ASR AM model")

### Automatic Speech Recognizer (ASR) Language Model (LM)

The language model for automatic speech recognition is the neural network model
for predicting the probability of a word given previous words in a sentence.
It generates posterior probabilities of the next word based from a sequence of
words. The words are encoded as indices in a fixed size dictionary.
The model has two inputs both of size one (integer): the current word index and
next word index, an output size of one (float): the log probability. It consits
of three embedding layer, three LSTM layers, followed by a multiplication, a
fully connected layers and an addition.
The corresponding parameters as shown in the figure.

![asr_lm_model](asr_lm.svg "ASR LM model")

### Endpointer Model

The endpointer model is the neural network model for predicting end of speech
in an utterance. More precisely, it generates posterior probabilities of various
events that allow detection of speech start and end events.
It has an input size of 40 (float) which are speech frontend features
(log-mel filterbanks), and an output size of four corresponding to:
speech, intermediate non-speech, initial non-speech, and final non-speech.
The model consists of a convolutional layer, followed by a fully-connected
layer, two LSTM layers, and two additional fully-connected layers.
The corresponding parameters as shown in the figure.
![endpointer_model](endpointer.svg "Endpointer model")


## Speech models test input/output generation

As mentioned above the input to models are generated from a pre-processing
module (output of a log-mel filterbank, or linguistic features), and the outputs
are generated by running the equivalent TensorFlow model by feeding them the
same input.

## Link to the open source code

### Models:

[Speech hotword model (Svdf
rank=1)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank1_2017_11_14.tflite)

[Speech hotword model (Svdf
rank=2)](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_hotword_model_rank2_2017_11_14.tflite)

[Speaker-id
model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_speakerid_model_2017_11_14.tflite)

[TTS
model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_tts_model_2017_11_14.tflite)

[ASR AM
model](https://storage.googleapis.com/download.tensorflow.org/models/tflite/speech_terse_am_model_2017_11_14.tflite)

### Test benches

[Speech hotword model
test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_hotword_model_test.cc)

[Speaker-id model
test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_speakerid_model_test.cc)

[TTS model
test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_tts_model_test.cc)

[ASR AM model
test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_asr_am_model_test.cc)

[ASR LM model
test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_asr_lm_model_test.cc)

[Endpointer model
test](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/models/speech_endpointer_model_test.cc)

## Android Support
The models have been tested on Android phones, using the following tests:

[Hotword] (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/android/BUILD?rcl=172930882&l=25)

[Speaker-id] (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/android/BUILD?rcl=172930882&l=36)