Speech recognition is the process of converting an acoustic signal to speech elements (e.g., phones, words and sentences). Speech recognition has found application in various areas ranging from telephony to vehicle navigation. In a speech recognition system, the audio signal is collected by input devices (e.g., a microphone), converted to a digital signal, and then processed using one or more algorithms to output speech elements contained in the audio signal. Depending on the field of application, the recognized speech elements can be the final results of speech recognition or intermediate information used for further processing.
Some speech recognition algorithms use acoustic models that statistically represent sounds corresponding to each speech element. The acoustic models may be created, for example, by correlating (also known as “compiling” or “acoustic model training”) audio samples of speech and corresponding text scripts. To improve the accuracy of recognition, a language model or a grammar file may be used to constrain the words to be recognized.
During speech recognition, the acoustic models may be adapted to increase the accuracy of the speech recognition. Especially when there are significant mismatches between the training conditions and conditions under which speed recognition is performed, acoustic model adaptation may increase the accuracy of speech recognition considerably. Techniques for adapting the acoustic models include, for example, maximum likelihood linear regression (MLLR), and maximum a posteriori (MAP), maximum likelihood linear regression (MLLR), maximum likelihood a posteriori linear regression (MAPLR) and Eigenvoices. Additionally, methods for normalizing the acoustic features before matching them to the acoustic models have been developed. Such methods include feature space maximum likelihood linear regression (fMLLR), feature space maximum a posteriori linear regression (fMAPLR), and vocal tract length normalization (VTLN)