tensorflow/docs_src/api_guides/python/contrib.signal.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

# Signal Processing (contrib)
[TOC]

@{tf.contrib.signal} is a module for signal processing primitives. All
operations have GPU support and are differentiable. This module is especially
helpful for building TensorFlow models that process or generate audio, though
the techniques are useful in many domains.

## Framing variable length sequences

When dealing with variable length signals (e.g. audio) it is common to "frame"
them into multiple fixed length windows. These windows can overlap if the 'step'
of the frame is less than the frame length. @{tf.contrib.signal.frame} does
exactly this. For example:

```python
# A batch of float32 time-domain signals in the range [-1, 1] with shape
# [batch_size, signal_length]. Both batch_size and signal_length may be unknown.
signals = tf.placeholder(tf.float32, [None, None])

# Compute a [batch_size, ?, 128] tensor of fixed length, overlapping windows
# where each window overlaps the previous by 75% (frame_length - frame_step
# samples of overlap).
frames = tf.contrib.signal.frame(signals, frame_length=128, frame_step=32)
```

The `axis` parameter to @{tf.contrib.signal.frame} allows you to frame tensors
with inner structure (e.g. a spectrogram):

```python
# `magnitude_spectrograms` is a [batch_size, ?, 129] tensor of spectrograms. We
# would like to produce overlapping fixed-size spectrogram patches; for example,
# for use in a situation where a fixed size input is needed.
magnitude_spectrograms = tf.abs(tf.contrib.signal.stft(
    signals, frame_length=256, frame_step=64, fft_length=256))

# `spectrogram_patches` is a [batch_size, ?, 64, 129] tensor containing a
# variable number of [64, 129] spectrogram patches per batch item.
spectrogram_patches = tf.contrib.signal.frame(
    magnitude_spectrograms, frame_length=64, frame_step=16, axis=1)
```

## Reconstructing framed sequences and applying a tapering window

@{tf.contrib.signal.overlap_and_add} can be used to reconstruct a signal from a
framed representation. For example, the following code reconstructs the signal
produced in the preceding example:

```python
# Reconstructs `signals` from `frames` produced in the above example. However,
# the magnitude of `reconstructed_signals` will be greater than `signals`.
reconstructed_signals = tf.contrib.signal.overlap_and_add(frames, frame_step=32)
```

Note that because `frame_step` is 25% of `frame_length` in the above example,
the resulting reconstruction will have a greater magnitude than the original
`signals`. To compensate for this, we can use a tapering window function. If the
window function satisfies the Constant Overlap-Add (COLA) property for the given
frame step, then it will recover the original `signals`.

@{tf.contrib.signal.hamming_window} and @{tf.contrib.signal.hann_window} both
satisfy the COLA property for a 75% overlap.

```python
frame_length = 128
frame_step = 32
windowed_frames = frames * tf.contrib.signal.hann_window(frame_length)
reconstructed_signals = tf.contrib.signal.overlap_and_add(
    windowed_frames, frame_step)
```

## Computing spectrograms

A spectrogram is a time-frequency decomposition of a signal that indicates its
frequency content over time. The most common approach to computing spectrograms
is to take the magnitude of the [Short-time Fourier Transform][stft] (STFT),
which @{tf.contrib.signal.stft} can compute as follows:

```python
# A batch of float32 time-domain signals in the range [-1, 1] with shape
# [batch_size, signal_length]. Both batch_size and signal_length may be unknown.
signals = tf.placeholder(tf.float32, [None, None])

# `stfts` is a complex64 Tensor representing the Short-time Fourier Transform of
# each signal in `signals`. Its shape is [batch_size, ?, fft_unique_bins]
# where fft_unique_bins = fft_length // 2 + 1 = 513.
stfts = tf.contrib.signal.stft(signals, frame_length=1024, frame_step=512,
                               fft_length=1024)

# A power spectrogram is the squared magnitude of the complex-valued STFT.
# A float32 Tensor of shape [batch_size, ?, 513].
power_spectrograms = tf.real(stfts * tf.conj(stfts))

# An energy spectrogram is the magnitude of the complex-valued STFT.
# A float32 Tensor of shape [batch_size, ?, 513].
magnitude_spectrograms = tf.abs(stfts)
```

You may use a power spectrogram or a magnitude spectrogram; each has its
advantages. Note that if you apply logarithmic compression, the power
spectrogram and magnitude spectrogram will differ by a factor of 2.

## Logarithmic compression

It is common practice to apply a compressive nonlinearity such as a logarithm or
power-law compression to spectrograms. This helps to balance the importance of
detail in low and high energy regions of the spectrum, which more closely
matches human auditory sensitivity.

When compressing with a logarithm, it's a good idea to use a stabilizing offset
to avoid high dynamic ranges caused by the singularity at zero.

```python
log_offset = 1e-6
log_magnitude_spectrograms = tf.log(magnitude_spectrograms + log_offset)
```

## Computing log-mel spectrograms

When working with spectral representations of audio, the [mel scale][mel] is a
common reweighting of the frequency dimension, which results in a
lower-dimensional and more perceptually-relevant representation of the audio.

@{tf.contrib.signal.linear_to_mel_weight_matrix} produces a matrix you can use
to convert a spectrogram to the mel scale.

```python
# Warp the linear-scale, magnitude spectrograms into the mel-scale.
num_spectrogram_bins = magnitude_spectrograms.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 64
linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
  num_mel_bins, num_spectrogram_bins, sample_rate, lower_edge_hertz,
  upper_edge_hertz)
mel_spectrograms = tf.tensordot(
  magnitude_spectrograms, linear_to_mel_weight_matrix, 1)
# Note: Shape inference for `tf.tensordot` does not currently handle this case.
mel_spectrograms.set_shape(magnitude_spectrograms.shape[:-1].concatenate(
  linear_to_mel_weight_matrix.shape[-1:]))
```

If desired, compress the mel spectrogram magnitudes. For example, you may use
logarithmic compression (as discussed in the previous section).

Order matters! Compressing the spectrogram magnitudes after
reweighting the frequencies is different from reweighting the compressed
spectrogram magnitudes. According to the perceptual justification of the mel
scale, conversion from linear scale entails summing intensity or energy among
adjacent bands, i.e. it should be applied before logarithmic compression. Taking
the weighted sum of log-compressed values amounts to multiplying the
pre-logarithm values, which rarely, if ever, makes sense.

```python
log_offset = 1e-6
log_mel_spectrograms = tf.log(mel_spectrograms + log_offset)
```

## Computing Mel-Frequency Cepstral Coefficients (MFCCs)

Call @{tf.contrib.signal.mfccs_from_log_mel_spectrograms} to compute
[MFCCs][mfcc] from log-magnitude, mel-scale spectrograms (as computed in the
preceding example):

```python
num_mfccs = 13
# Keep the first `num_mfccs` MFCCs.
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
    log_mel_spectrograms)[..., :num_mfccs]
```

[stft]: https://en.wikipedia.org/wiki/Short-time_Fourier_transform
[mel]: https://en.wikipedia.org/wiki/Mel_scale
[mfcc]: https://en.wikipedia.org/wiki/Mel-frequency_cepstrum