1. Field of the Invention
The present invention relates to a method of and apparatus for transforming a speech feature vector, and more particularly, to a method of and apparatus for transforming speech feature vectors using an auto-associative neural network (AANN).
2. Description of the Related Art
Although the application fields of speech recognition technology have expanded to information electronic appliances, computers, and high-density telephony servers, the variation of recognition performance with surrounding factors obstructs further expansion of the speech recognition technology to practical uses.
To address the variation of speech recognition performance caused by surrounding noise, much research has been conducted on techniques for linearly or non-linearly transforming a conventional mel-frequency cepstral coefficient (MFCC) feature vector based on the temporal characteristics of a speech feature vector during a speech feature vector extraction process, which is the first stage in speech recognition.
For example, conventional transformation algorithms based on the temporal characteristics of a feature vector, such as cepstral mean subtraction and mean-variance normalization, were disclosed in “On Real-Time Mean-Variance Normalization of Speech Recognition Features (ICASSP, 2006, pp. 773-776)” by P. Pujol, D, Macho and C. Nadeu, a relative spectral algorithm (RASTA) was disclosed in “Data-Driven RASTA Filters in Reverberation (ICASSP, 2000, pp. 1627-1630)” by M. L. Shire et al., histogram normalization was disclosed in “Quantile Based Histogram Equalization for Noise Large Vocabulary Speech Recognition (IEEE Trans, Audio, Speech, Language Processing, vol. 14, no. 3, pp. 845-854)” by F. Hilger and H. Ney, and an augmenting delta feature algorithm was disclosed in “On the Use of High Order Derivatives for High Performance Alphabet Recognition (ICASSP, 2002, pp. 953-956)” by J. di Martino.
Techniques for linearly transforming feature vectors, such as methods for transforming feature data in a temporal frame using linear discriminant analysis (LDA) and principal component analysis (PCA), were disclosed in “Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition (IEEE Trans. Audio, Speech and Language Processing, vol. 14, No. 3, 2006, pp. 808-832)” by Jeih-Weih Hung et al.
Techniques using a non-linear neural network, such as a temporal pattern (TRAP) algorithm, were disclosed in “Temporal Patterns in ASR of Noisy Speech (ICASSP, 1999, pp. 289-292)” by H. Hermansky and S. Sharma, and automatic speech attribute transcription (ASAT) was disclosed in “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition (ICASSP, 2006, pp. 837-840)” by Jinyu Li, Yu Tsao, and Chin-Hui Lee.
FIG. 1 is a block diagram of feature vector transformation using a TRAP algorithm according to the prior art.
Referring to FIG. 1, a log critical-band energy feature vector 100 extracted from a speech signal passes through preprocessors 105 and 115 corresponding to each band and then the log critical-band energy feature vector 100 passes through a nonlinear network of multi-layer perceptrons (MLPs) 110 and 120, thereby being transformed into a feature vector used in a recognition apparatus. A target value of each of the MLPs 110 and 120 corresponding to each band is given by a phonemic class of a transformation frame. For example, when 39 phonemes are recognized, a target value of output neurons of an MLP corresponding to each band is set to 1 for an output neuron having only a phonemic class corresponding to a transformation frame among the 39 phonemes and is set to −1 for the other output neurons. In other words, phonemic class information for each frame is given as a target value. Each of the MLPs 110 and 120 corresponding to each band has 39 output values that are input to an integrated MLP 130. A target value of the integrated MLP 130 is also set using phonemic class information of a transformation frame like a target value of an MLP corresponding to each band. Such TRAP speech feature transformation sufficiently reflects a longer temporal correlation than a conventional MFCC feature vector, but is available only when phonemic class information for each frame is previously provided.
In general, a speech database has no phonemic class information for each frame. As a result, for the application of the TRAP algorithm, a recognition model is formed using a conventional MFCC feature vector and phoneme transcription is performed for each frame in a forced alignment manner. However, the recognition model may have an error, which means that an error would likely occur in phoneme transcription for each frame. For this reason, there is a high probability that a neural network of the TRAP algorithm learn wrong target values.
FIG. 2 is a block diagram of a feature vector transformation using an ASAT algorithm according to the prior art.
Like the TRAP algorithm illustrated in FIG. 1, the ASAT algorithm generates a feature vector through a two-stage non-linear neural network of MLPs. Unlike the TRAP algorithm illustrated in FIG. 1 that uses an MLP corresponding to each band, adjacent frame vectors 200 are input to the neural network and a target value of first-stage MLPs 210 and 215 has phonemic class information of a transformation frame. In other words, phonemes are classified into 15 classes of vowel, stop, fricative, approximant, nasal, low, mid, high, dental, labial, coronal, palatal, velar, glottal, and silence and a target value corresponding to the phonemic class of the transformation frame is set. A result from the first-stage MLPs 210 and 215 is input to a second-stage integrated MLP 220 and a target value of the integrated MLP 220 is set to a phoneme of the transformation frame.
Thus, like the TRAP algorithm illustrated in FIG. 1, feature vectors of the ASAT algorithm also output a phoneme target value for each frame like the TRAP algorithm illustrated in FIG. 1 and phonemic information of each frame is required for learning of the neural network.