Speech recognition involves the identification of words or phrases in speech. It generally involves using a speech recognition system including, e.g., a computer, that analyzes the speech according to one or more speech recognition methods to identify the words or phrases included in the speech. Speech recognition may be either speaker dependent, speaker independent or a combination of both.
Speaker dependent speech recognition normally uses a computer that has been "trained" to respond to the manner in which a particular person speaks. In general, the training involves the particular person speaking a word or phrase, converting the speech input into digital signal data, and then generating a template or model of the speech which includes information about various characteristics of the speech. Because template and models include speech characteristic information, e.g., energy level, duration, and other types of information, templates and models are well suited for speech recognition applications where such characteristics can be measured in received speech and compared to the information included in the templates or models. For purposes of this application, the words templates and models are used interchangeably to refer to sets of speech characteristic information used for speech recognition purposes.
Models generated during a speech recognition training process are normally stored in a database for future use during a speech recognition operation. During real time speech recognition applications, input speech is processed in a manner similar to that used to generate models during training. The signal characteristic information or data generated by processing the speech upon which a recognition operation is to be performed is then normally compared to a user's set of speaker dependent models and/or speaker independent models. A match between the input speech and the models is determined in an attempt to identify the speech input. Upon recognition of a particular word or phase, an appropriate response is normally performed.
Speaker independent speech recognition normally uses composite templates or models or clusters thereof, that represent the same sound, word, or phrase spoken by a number of different persons. Speaker independent models are normally derived from numerous samples of signal data to represent a wide range of pronunciations. Accordingly, speaker independent speech recognition normally involves the training of models from a large database of utterances, e.g., of the same word or phrase, collected from a large number of different speakers. Speaker independent speech recognition models are designed to recognize the same word or phrase regardless of the identity of the speaker and the potential specific speech characteristics unique to an individual speaker.
Because speaker independent speech recognition models are normally trained from relatively large databases of information, it is often possible to generate meaningful information regarding a relatively large number of speech characteristics for use as a speaker independent speech recognition model. In contrast, speaker dependent speech recognition models are often generated from relatively few utterances, e.g., three utterances. Because of this, the number of meaningful speech characteristics that can normally be statistically modeled to a useful degree during speaker dependent model training is often lower than the number of characteristics that can be modeled in a meaningful way during speaker independent training. As a result, speaker independent speech recognition systems often use more complicated higher resolution, e.g., detailed, models than speaker dependent speech recognition systems.
The use of Hidden Markov Models ("HMMs") is one common approach to generating both speaker independent and speaker dependent speech recognition models. Other modeling techniques are also known. Various training techniques for generating HMMs from speech, e.g., utterances of a word or phrase are also known.
One known technique for using a speech recognition model to recognize speech is referred to a Viterbi search. In such a search, speech characteristic information generated from a received utterance is compared, as a function of time, to a plurality of potential paths where each path represents one potential speech recognition result. Each path may include a plurality of speech recognition models arranged according to grammar rules. Normally, where the goal is to recognize a single word, one distinct path will be used to represent each word which may be recognized by the speech recognition system. Such a path may include, e.g., silence models, preceding the model of the word to be recognized. The arrangement of speech recognition models (paths) formed by grammar rules of a given system will simply be referred to a system's grammar. In some embodiment's recognition of a word or phase corresponding to a particular path formed by the grammar of a system is normally declared when a score for the path exceeds a predetermined threshold. Ideally, the path that most closely matches the received speech will be the first to exceed the recognition threshold and be recognized.
Existing speech modeling and recognition techniques have been applied to a wide variety of applications including, e.g., telephone voice dialing services. Despite improvements in speech recognition modeling techniques in recent years, including the use of HMM models, errors in recognition results still occur. In order to reduce the number of false positive recognitions, i.e., where a speech recognition system incorrectly determines that an utterance includes a modeled word or phrase, some systems have incorporated the use of speaker independent garbage models. These models are also speaker independent, sometimes also referred to as out of vocabulary (OOV) models. Garbage models have been used to identify and reject certain words, phrases or sounds which are not included in the speech recognition systems vocabulary.
Speaker independent garbage models are normally created from a large database of utterances and are trained to model sound characteristics which distinguish sounds, utterances or words not included in a speech recognition system's word or phrase vocabulary from those that are. Speaker independent garbage models have been used to model, e.g., the sound of coughs, uhm's and other non-significant sounds or words.
While the known modeling and speech recognition techniques have provided satisfactory recognition rates for many applications, undesirable erroneous recognition results still occur.
Erroneous recognition results are particularly a problem in the case of speaker dependent speech recognition applications where similar sounding words or names are to be recognized. This is because, as discussed above, speaker dependent speech recognition models are often trained using a rather limited number of training utterances resulting in somewhat less discriminating models than could be generated from a larger number of utterances. Speaker dependent recognition is often further complicated by the lack of useful speaker dependent garbage models. Unfortunately, in the case of speaker dependent speech recognition applications, the limited amount of data (utterances) available for training speaker dependent speech recognition models also makes the training of useful speaker dependent garbage models difficult to achieve.
Voice dialing is one example of a speech recognition application where a single user may have a plurality of like sounding names or names that sound similar to speaker independent commands, in a single directory. In as much as costs and wasted time may be associated with miss-identifications of names and/or voice dialing commands, e.g., in terms of misrouted calls, it is desirable to minimize erroneous speech recognition results.
In view of the above, it is apparent that there is a need to for methods and apparatus which can be used to improve speech recognition systems in general and to reduce erroneous recognition results in particular. In addition, there is a need for methods and apparatus for generating garbage models from a limited set of utterances so that useful speaker dependent garbage models can be created from a limited set of available information, e.g., the same information used to create speaker dependent speech recognition models.