The present invention relates to methods of speech recognition based on a hidden Markov model and to corresponding voice recognizers.
Speech recognition systems are becoming increasingly widespread in various technical fields and also in private use, including the processing of orders over the telephone and the processing of other kinds of customer services over the telephone in banks, dispatching businesses, etc., in speech-controlled text processing in offices and for private use, and in the speech-actuated control of technical equipment of all types.
An effective and flexible adaptation to speakers, i.e., a high degree of precision recognition of words and sentences which are spoken by different speakers, is of essential significance for the practical use of speech recognition systems. Speech recognition systems for the commercial applications mentioned above are subject to particularly stringent requirements in this respect as it is not possible to train them to one speaker or to a small number of speakers but instead the system must process the speech inputs of a large number of speakers with a wide variety of speech properties and linguistic idiosyncrasies with maximum reliability.
Speech recognition systems based on hidden Markov models (HMM) in which the words to be recognized are modeled as chains of states and trained with predefined speech data material are known. Here, two basically different procedures can be put into practice:
In whole word models, words are modeled by states which correspond to parts of a specific word and only apply to the respective word. This modeling supplies good recognition results but can, however, only recognize words that were also a component of the speech data material used in the training process. Moreover, this method of modeling is only suitable for small vocabularies, since it becomes too costly in terms of processing for relatively large vocabularies and, thus, also too slow. Whole word models are usually used in applications where only numbers or chains of numbers are to be recognized.
In phoneme-based models, the words are modeled by means of states which correspond to phonemes or parts of phonemes. This modeling is independent of the specific vocabulary of the speech data material so that any desired additional words can be added to the speech recognizer during later practical use. However, this advantage is gained at the cost of a lower degree of recognition precision.
Nevertheless, in speech recognition systems that have to cope with large vocabularies which can be expanded in a flexible way these phoneme-based models are used exclusively.