This application claims the priority of Korean Patent Application No. 10-2007-0018666, filed on Feb. 23, 2007, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present invention relates to speech recognition, and more particularly, to a multi-stage speech recognition apparatus and method, which rescore a plurality of candidate words obtained from initial recognition using a temporal posterior feature vector.
2. Description of the Related Art
Currently, speech recognition technology is gradually expanding its application range from personal mobile terminals to information electronic appliances, computers, and high-capacity telephony servers. However, unstable recognition performance varying according to the surrounding environment serves as the biggest obstacle in applying speech recognition technology to a wider range of real-life products.
In order to reduce instability of speech recognition performance due to, for example, noise generated in the surrounding environment, diverse studies are being conducted on technologies for linearly or non-linearly converting conventional mel-frequency cepstral coefficient (MFCC) feature vectors in consideration of their temporal features in a speech feature vector extraction process which is the first stage of speech recognition technology.
Conventional conversion algorithms, which take into consideration temporal features of feature vectors, include cepstral mean subtraction, mean-variance normalization disclosed in “On Real-Time Mean-Variance Normalization of Speech Recognition Features,” P. Pujol, D. Macho and C. Nadeu, ICASSP, 2006, pp. 773-776, a RelAtive SpecTrAl (RASTA) algorithm disclosed in “Data_Driven RASTA Filters in Reverberation,” M. L. Shire et al, ICASSP, 2000, pp. 1627-1630, histogram normalization disclosed in “Quantile Based Histogram Equalization for Noise Robust Large Vocabulary Speech Recognition,” F. Hilger and H. Ney, IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 845-854, and an augmenting delta feature disclosed in “On the Use of High Order Derivatives for High Performance Alphabet Recognition, J. di Martino, ICASSP, 2002, pp. 953-956.
Conventional technologies for linearly converting feature vectors include methods of converting feature data in temporal frames using linear discriminant analysis (LDA) and principal component analysis (PCA) disclosed in “Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition,” Jeih-Weih Hung et al, IEEE Trans. Audio, Speech, and Language Processing, vol. 14, No. 3, 2006, pp. 808-832.
Conventional conversion methods using non-linear neural networks include a tempoRAI patterns (TRAP) algorithm disclosed in “Temporal Patterns in MSR of Noisy Speech,” H. Hermansky and S. Sharma, ICASSP, 1999, pp. 289-292, automatic speech attribute transcription (ASAT) disclosed in “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition,” Jinyu Li, Yu Tsao and Chin-Hui Lee, ICASSP, 2005, pp. 837-840.