1. Field of the Invention
The invention relates generally to speech recognition systems, and more particularly to an improved method of developing and employing a garbage model in a speaker-dependent speech recognition system having limited resources such as a cellular telephone.
2. Description of Related Art
The user interfaces of many electronic systems now involve speech recognition technology. There are two general types of speech recognition systems: (1) "speaker independent" (SI) systems; and (2) "speaker dependent" (SD) systems. Some phone companies, for example, have used SI speech recognition technology to create directory assistance mechanisms whereby any user may say the name of the city for which directory assistance is desired. Likewise, some cellular telephones feature SD speech recognition technology so that a particular user may "train" the phone to recognize "call home" and then automatically dial the appropriate number.
Unlike SI systems, SD systems require training. SD systems, however, are normally hampered by having only limited training data because the user of such systems would find it annoying to provide extensive training data. Moreover, SD systems are often used in a portable device, such as a cellular phone, which tend to have severely limited resources in terms of memory and/or computing power because they are necessarily designed within certain size, memory, cost and power constraints. The solutions suitable for implementation in an SI system, therefore, are not generally applicable to an SD system having limited training data, particularly where such SD system is used in a portable device, such as a cellular phone, having limited resources.
All speech recognition systems generally attempt to match an incoming "utterance" with one of a plurality of predetermined "vocabulary" words. In a typical implementation, the acoustic utterance is converted to a digital token, analyzed or decomposed in terms of characteristic "features," and then simultaneously compared, feature-by-feature, with one or more word models that each represent a vocabulary word.
FIG. 1, for example, shows a simplified network 20 that assigns an input utterance to one of N predetermined vocabulary words WORD_1 to WORD_N by finding the best match between certain "features" 200 of the input utterance and one of a plurality of "word models" 20-1 to 20-N. The FIG. 1 system, however, is subject to "mismatches" and "false acceptances":