1. Field of the Invention
The present invention relates to speech recognition and, more particularly, to a system and method of speech recognition based on pre-clustering of training models for continuous speech recognition.
2. Description of the Related Art
In an effort to provide a more usable, convenient and rapid interface for speech recognition, numerous approaches to voice and sound recognition have been attempted. However, variations in acoustic signals, even from a single speaker present substantial signal processing difficulties and present the possibility of errors or ambiguity of command understanding by the system which may only be partially avoided by substantial increase of processing complexity and increase in response time.
One proposed technique, described in an article by M. Padmanabhan, et al, "Speaker Clustering Transformation for Speaker Adaptation in Large-Vocabulary Speech Recognition Systems", ICASSP '96, addressing the above-mentioned issues includes a speaker adaptation scheme based on a speech training corpus containing a number of training speakers, some of whom are closer, acoustically, to a test speaker, than others. Given a test speaker, if the acoustic models are re-estimated from a subset of the training speakers who are acoustically close to the test speaker, the proposed technique should find a better match to the test data of the speaker. A further improvement could be obtained if the acoustic space of each of these selected speakers is transformed to come closer to the test speaker.
Given a test speaker, the adaptation procedure used in the proposed technique is: (1) find a subset of speakers from the training corpus, who are acoustically close to the test speaker; (2) transform the data of each of these speakers to bring it closer to the test speaker, and (3) use only the (transformed) data from these selected speakers, rather than the complete training corpus, to re-estimate the model (Gaussian) parameters. This scheme was shown to produce better speaker adaptation performance that other algorithms, for example maximum likelihood linear regression (MLLR), or maximum a posteriori (MAP) adaptation, when only a small amount of adaptation data was available.
The implementation of the proposed technique uses the transformed training data of each selected training speaker to re-estimate the system parameters. This requires the entire training corpus to be available on-line for the adaptation process, and is not practical in many situations. This problem can be circumvented if a model is stored for each of the training speakers, and the transformation (to bring the training speaker closer to the test speaker) is applied to the model. The transformed models are then combined to produce the speaker-adapted model. However, due to the large number of training speakers, storing the models of each training speaker would require a prohibitively large amount of storage. Also, sufficient data from each training speaker to robustly estimate the parameters of the speaker-dependent model for the training speaker may not be available.
Therefore, a need exists for an improved system and method of speech recognition which adapts to different speakers. A further need exists for reducing storage space needed to store training speaker models.