Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Computerized speech recognition can be broken down into a series of procedures. One procedure is to convert a stream of “acoustic features”, or sampled and filtered speech data, to a stream of phonemes, which are then recognized as words.
Each acoustic feature can represent one or more samples of speech. For example, a fixed duration of speech can be sampled at fixed intervals of time; e.g. every 10-30 milliseconds. The sample can be transformed into a set of mel-frequency centered cepstral coefficients (MFCC) using well-known techniques. The set of MFCC coefficients corresponding to one sample of speech can be considered to be one acoustic feature.
Typically, a group of acoustic features are concatenated into a vector. For example, one sample of speech can be transformed into a set of MF MFCCs, where MF could be in the range of 10 to 15. A collection of NF acoustic vectors can be combined to form a feature vector z with MF*NF entries. A “frame” or subset of acoustic features x can be generated using a linear combination of -the features in the feature vector z. For example, if MF=13, and NF=9, then z is a vector with 117 entries and each entry in the frame y0 can be selected from linear combinations of the features in feature vector z, perhaps using Linear Discriminant Analysis (LDA). in a typical example, the frame y0 has 39 entries representing linear combinations of the 117 entries in z.
FIG. 1 is a block diagram of a prior art speech recognition device 100. FIG. 1 shows speech recognition device 100 receiving sample utterance 110, which is processed by digital signal processor (DSP) 120 of speech recognition device 100 into a set of MFCCs 130. In some contexts, each coefficient of MFCCs 130 can be termed a “feature”; e.g., FIG. 1 shows MFCCs 130 as a set of 13 features.
Speech recognition device 100 can concatenate two or more sets of MFCCs to generate feature vector z 140. FIG. 1 shows that NF=9 sets of MFCCs z0 141, z1 142 . . . zNF-1 149 are concatenated to form feature vector z 140 with a total of 117 features. Then, speech recognition device 100 can form frame y0 152 with 39 features using LDA technique 150 to select certain features (shown in black) from feature vector z 140.
Experimentation has indicated that “displacing” frame y0 152 in the feature space to generate a displaced vector can lead to better performance in recognizing utterance 110. Current techniques model this displacement, shown in FIG. 1 as displacement 154, as a matrix-vector product between a displacement matrix M0 and a vector h0 that is based on frame y0 152; i.e., h0=φh(y0). Typically, φh(y0) is a radial basis function operating on frame y0 152. A radial basis function operating on a vector y0 is a function whose value depends on y0's distance, such as the Mahalanobis distance, from a designated center point c.
Displaced frame x′ 156 can be written as:x′=y0+M0h0=M0φh(y0)  (2)
The matrix M0 can be learned on training data using a procedure, such as taught in Povey, Kingsbury, Mangu, Saon, Soltau and Zweig. Speech recognition device 100 can then provide displaced frame x′ 156 to speech recognizer (SR) 160. Then, speech recognizer 160 can take displaced frame x′ 156 as an input and, utilizing well-known speech recognition techniques, generate recognized speech (RS) 170 as a corresponding output.