In recent years, increasingly powerful speech recognition systems have begun to appear. For example, the assignee of the present invention markets a speech recognition system capable of recognizing active vocabularies up to 5,000 words in real-time on a personal computer. To get the best possible performance on difficult tasks such as recognition of a vocabulary of 5,000 words or more, it is desirable to train the system to the voice of each individual user. It is also desirable to customize the vocabulary to the particular subject matter and to the vocabulary usage of the individual user. To get the best results in recognizing unconstrained natural text, it is desirable to use higher-level linguistic knowledge that predicts which words are most probable, based on the context.
However, to make it possible to train the system to the voice of the individual user, it has been necessary in prior art speech recognition systems for the user to record the entire vocabulary, or else to separately record a special training or enrollment vocabulary before any words can be recognized. To customize the vocabulary and the higher-level linguistic knowledge, it has been necessary for the user to supply large amounts of sample text from which the statistical linguistic characteristics can be derived. There is a significant amount of time, effort, and expense that is necessary before such a large vocabulary, natural language recognition system is ready to be used by a new speaker or in a new area of subject matter.
Among the techniques in prior art speech recognition to limit or reduce the training and customization time is the use of speaker-independent acoustic models. Unfortunately, speaker-independent acoustic models must allow for the total amount of variability in the way a given word can be pronounced by many different speakers. This tends to result in fuzzy, or poorly defined, word models. With this variability, or fuzziness, there will be a greater amount of overlap between the region of acoustic parameter space that represents a given sound and the region which represents a different, but similar sound. Thus, other things being equal, there will be a higher error rate in a speech recognition system using speaker-independent acoustic models than in a speech recognition system using speaker-dependent acoustic models. For large vocabulary, natural language speech recognition tasks, it is important to get the highest possible performance in the acoustic recognition, because the overall recognition task is very difficult.
To get the best possible performance, therefore, the acoustic models and the linguistic model should be customized. The disadvantage of the customization, however, is the cost in time of the customization, and the delay before the speech recognition system can be used for productive work.