1. Field
The present invention relates to speech signal processing. More particularly, the present invention relates to a novel method and apparatus for distributed voice recognition using acoustic feature vector modification.
2. Background
Voice recognition represents one of the most important techniques to endow a machine with simulated intelligence to recognize user voiced commands and to facilitate human interface with the machine. Systems that employ techniques to recover a linguistic message from an acoustic speech signal are called voice recognition (VR) systems. FIG. 1 shows a basic VR system having a preemphasis filter 102, an acoustic feature extraction (AFE) unit 104, and a pattern matching engine 110. The AFE unit 104 converts a series of digital voice samples into a set of measurement values (for example, extracted frequency components) called an acoustic feature vector. The pattern matching engine 110 matches a series of acoustic feature vectors with the patterns contained in a VR acoustic model 112. VR pattern matching engines generally employ Viterbi decoding techniques that are well known in the art. When a series of patterns are recognized from the acoustic model 112, the series is analyzed to yield a desired format of output, such as an identified sequence of linguistic words corresponding to the input utterances.
The acoustic model 112 may be described as a database of acoustic feature vector extracted from various speech sounds and associated statistical distribution information. These acoustic feature vector patterns correspond to short speech segments such as phonemes, tri-phones and whole-word models. “Training” refers to the process of collecting speech samples of a particular speech segment or syllable from one or more speakers in order to generate patterns in the acoustic model 112. “Testing” refers to the process of correlating a series of acoustic feature vectors extracted from end-user speech samples to the contents of the acoustic model 112. The performance of a given system depends largely upon the degree of correlation between the speech of the end-user and the contents of the database.
Optimally, the end-user provides speech acoustic feature vectors during both training and testing so that the acoustic model 112 will match strongly with the speech of the end-user. However, because an acoustic model 112 must generally represent patterns for a large number of speech segments, it often occupies a large amount of memory. Moreover, it is not practical to collect all the data necessary to train the acoustic models from all possible speakers. Hence, many existing VR systems use acoustic models that are trained using the speech of many representative speakers. Such acoustic models are designed to have the best performance over a broad number of users, but are not optimized to any single user. In a VR system that uses such an acoustic model, the ability to recognize the speech of a particular user will be inferior to that of a VR system using an acoustic model optimized to the particular user. For some users, such as users having a strong foreign accent, the performance of a VR system using a shared acoustic model can be so poor that they cannot effectively use VR services at all.
Adaptation is an effective method to alleviate degradations in recognition performance caused by a mismatch in training and test conditions. Adaptation modifies the VR acoustic models during testing to closely match with the testing environment. Several such adaptation schemes, such as maximum likelihood linear regression and Bayesian adaptation, are well known in the art.
As the complexity of the speech recognition task increases, it becomes increasingly difficult to accommodate the entire recognition system in a wireless device. Hence, a shared acoustic model located in a central communications center provides the acoustic models for all users. The central base station is also responsible for the computationally expensive acoustic matching. In distributed VR systems, the acoustic models are shared by many speakers and hence cannot be optimized for any individual speaker. There is therefore a need in the art for a VR system that has improved performance for multiple individual users while minimizing the required computational resources.