The present invention relates generally to speech recognizers. More particularly, the invention relates to a small memory footprint recognizer suitable for embedded applications where available memory and processor resources are limited. New words are added to the recognizer lexicon by entry as spelled words that are then converted into phonetic transcriptions and subsequently into syllabic transcriptions for storage in the lexicon.
The trend in consumer products today is to incorporate speech technology to make these products easier to use. Many consumer products, such as cellular telephones, offer ideal opportunities to exploit speech technology, however they also present a challenge in that memory and processing power is often limited. Considering the particular case of using speech recognition technology for voice dialing of cellular telephones, the embedded recognizer will need to fit into a relatively small amount of non-volatile memory, and the random access memory used by the recognizer in operation is also fairly limited.
To economize memory usage, the typical embedded recognizer system will have a very limited, often static, vocabulary. The more flexible large vocabulary recognizers that employ a phonetic approach combined with statistical techniques, such as Hidden Markov Model (HMM), use far too much memory for many embedded system applications. Moreover, the more powerful, general purpose recognizers model words on subword units, such as phonemes that are concatenated to define the words models. Frequently these models are context-dependent. They store different versions of each phoneme according to what neighboring phonemes precede and follow (typically stored as triphones). For most embedded applications there are simply too many triphones to be stored in a small amount of memory.
Related to the memory constraint issue, many embedded systems have difficulty accommodating a user who wishes to add new words to the lexicon of recognized words. Not only is lexicon storage space limited, but the temporary storage space needed to perform the word addition process is also limited. Moreover, in embedded systems, such as the cellular telephone, where the processor needs to handle other tasks, conventional lexicon updating procedures may not be possible within a reasonable length of time. User interaction features common to conventional recognizer technology are also restricted. For example, in a conventional recognizer system, a guidance prompt is typically employed to confirm that a word uttered by the user was correctly recognized. In conventional systems the guidance prompt may be an encoded version of the users recorded speech. In some highly constrained embedded systems, such guidance prompts may not be practical because the encoded version of the recorded speech (guidance voice) requires too much memory.
The present invention addresses the above problems by providing a small memory footprint recognizer that may be trained quickly and without large memory consumption by entry of new words through spelling. The user enters characters, such as through a keyboard or a touch-tone pad of a telephone, and these characters are processed by a phoneticizer that uses decision trees or the like to generate a phonetic transcription of the spelled word. If desired, multiple transcriptions can be generated by the phoneticizer, yielding the n-best transcriptions. Where memory is highly constrained, the n-best transcriptions can be generated using a confusion matrix that calculates the n-best transcriptions based on the one transcription produced by the phoneticizer. These transcriptions are then converted into another form based on hybrid sound units described next.
The system employs a hybrid sound unit for representing words in the lexicon. The transcriptions produced by the phoneticizer are converted into these hybrid sound units for compact storage in the lexicon. The hybrid units can comprise a mixture of several different sound units, including syllables, demi-sylables, phonemes and the like. Preferably the hybrid units are selected so that the class of larger sound units (e.g., syllables) represent the most frequently used sounds in the lexicon, and so that one or more classes of smaller sound units (e.g. demi-syllables and phonemes) represent the less frequently used sounds. Such a mixture gives high recognition quality associated with larger sound units without the large memory requirement. Co-articulated sounds are handled better by the larger sound units, for example.
Using a dictionary of hybrid sound units, the transcriptions produced by phonetic transcription are converted to yield the n-best hybrid unit transcriptions. If desired, the transcriptions can be rescored at this stage, using decision trees or the like. Alternatively, the best transcription (or set of n-best transcriptions) is extracted through user interaction or by comparison to the voice input supplied by the user (e.g., through the microphone of a cellular telephone).
A word template is then constructed from the extracted best or n-best transcriptions, by selecting previously stored hybrid units from the hybrid unit dictionary and these units are concatenated to form a hybrid unit string representing the word. Preferably the hybrid units are represented using a suitable speaker-independent representation; a phone similarity representation is presently preferred although other representations can be used. The spelled word (letters) and the hybrid unit string (concatenated hybrid units) are stored in the lexicon as a new entry. If desired, the stored spelled word can be used as a guidance prompt by displaying it on the LCD display of the consumer product.
The recognizer of the invention is highly memory efficient. In contrast with the large lexicon of HMM parameters found in conventional systems, the lexicon of the invention is quite compact. Only a few bytes are needed to store the spelled word letters and the associated hybrid unit string. Being based on hybrid units the word model representation is highly compact and the hybrid unit dictionary used in word template construction is also significantly smaller than dictionaries found in conventional systems.