A deep neural network (DNN) is a stacked feed-forward, artificial neural network that has more than one layer of hidden units between its inputs and its outputs. The layers are composed of nodes, which are locations where computation occurs, loosely patterned on a neuron in the human brain, which fires when it encounters sufficient stimuli. A node combines input from the data with a set of coefficients, or weights, that either amplify or dampen that input, which assigns significance to inputs for the task the algorithm is trying to learn. A set of items, as used herein, is a group of one or more items. For example, a set of coefficients is a group of one or more coefficients. These input-weight products are summed and the sum is passed through what is called a node's activation function, to determine whether and to what extent that signal progresses further through the network to affect the ultimate outcome. DNN uses a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. Higher level features can be derived from lower level features to form a hierarchical representation. The layers following the input layer may be convolution layers that are feature maps that can include filtering of the inputs to each convolution layer.
Speech can be viewed as a continuous audio stream in which substantially stable states mix with dynamically changed states. In this sequence of states, one can define classes of sounds, or phones, which are speech segments. A phone is a speech segment that possesses distinct physical or perceptual properties, considered as a physical event without regard to its place in the phonology of a language. When a phone is considered in context with a first part of the phone depending on its preceding phone, the middle part of the phone being stable, and the next part of the phone depending on the subsequent phone, such a phone in context is called a triphone. Each triphone is represented by a hidden Markov model (HMM) with several states. Many states of the HMMs are shared (tied together) among different triphones. A tied state in the triphone HMM is called a senone. Speech recognition scientists have identified several thousand senones into which all speech may be divided. An acoustic model contains acoustic properties for each senone.
A DNN based acoustic model has been widely used in automatic speech recognition (ASR) and has achieved extraordinary performance. However, a speaker-independent (SI) acoustic model trained with speech data collected from a large number of speakers suffers from a large degradation in ASR performance when tested with speakers not included in the training set. This degradation results from the spectral variations in each speech unit caused by the inter-speaker variability in addition to the phonetic variations characterized by the SI acoustic model.
A simple solution to the inherent inter-speaker variability in speech signals is to perform feature space normalization over different speakers before estimating the acoustic model parameters, such as cepstrum mean and variance normalization, vocal tract length normalization, and metamorphic algorithm. Cepstrum analysis is a nonlinear signal processing technique with a variety of applications in areas such as speech and image processing in which the complex cepstrum of a sequence x is calculated by finding the complex natural logarithm of the Fourier transform of x, then the inverse Fourier transform of the resulting sequence. A more sophisticated solution that generates acoustic models with reduced variance is to perform speaker-adaptive training (SAT).
For a DNN acoustic model, factorized hidden layer, cluster adaptive training, and speaker code approaches have been proposed, in which the weights or/and the biases of the speaker-dependent (SD) affine transformation in each hidden layer of a DNN acoustic model are represented as a linear combination of SI bases, where the combination weights are low-dimensional speaker representations initialized with an i-vector. An i-vector framework is a factor analysis method for a compact representation of speaker characteristics. The framework maps every speaker utterance to a low dimensional identity vector. Target and test i-vectors can then be compared using a cosine distance metric. In various implementations, i-vectors convey speaker characteristics among other information such as transmission channel, acoustic environment or phonetic content of a speech segment. The canonical SI bases with reduced variances are learned during adaptive training. The speaker representations for the test speaker are estimated using adaptation data and are used for testing. In SAT-learning hidden units contribution, a canonical speaker-adaptive DNN along with SD amplitude parameters for all the hidden units are learned during adaptive training.