In a speaker dependent speech recognition system, users must enroll the vocabulary words they wish to have available when using the system. A vocabulary "word" can be a single spoken word or a short phrase, and the vocabulary words chosen depend on the particular application. For example, a speech recognition implementation for a portable radiotelephone might require the user to provide the names and locations of frequently called people (e.g., "Fred's office"), or commands for frequently used features usually available in a user interface (e.g., "battery meter", "messages", or "phone lock").
During an enrollment procedure, a speech recognition system is responsive to the user's input to derive a representative template for each vocabulary word. In some systems, this template can be represented by a hidden Markov model (HMM) which consists of a series of states. Each state represents a finite section of a speech utterance: utterance as used herein referring to a "vocabulary word" which may comprise one or more words. A statistical representation of each state of an HMM is calculated using one or more enrollment speech samples of a particular vocabulary word uttered by the user. This is accomplished through frame-to-state assignments.
Such state assignments are used both for training and voice recognition modes of operation. In particular, the assigned states are used to create models in a training mode which are used as a comparison reference during speech recognition mode. The assignments for input utterances in a voice recognition mode of operation are used to compare the input utterances to stored reference models during the voice recognition mode.
An alignment algorithm, such as a Viterbi algorithm is used for frame-to-state alignment of an utterance. This alignment algorithm, which provides the best match of the speech utterance onto the model, is used to assign each frame of the vocabulary word utterance to individual states of the model. Using this assignment, the statistical representations for each state can be refined.
Because of the amount of information, most speech recognition systems require large amounts of both volatile memory, such as random access memory (RAM), and non-volatile memory (NVM), such as flash ROM or electronically erasable read only memory (EEPROM). These memory requirements can be prohibitively expensive for cost-sensitive applications such as portable wireless communication devices. Additionally, speech recognition systems require significant computational requirements measured in millions of instructions per second (MIPS). The large number of MIPS are required for training and voice recognition. This large MIPS requirement can negatively impact the performance of the device in which voice recognition is employed by using valuable resources and slowing down operating speeds.
In order to implement a speaker dependent training and recognition algorithm on a portable device, such as wireless communication device where very little random access memory (RAM) is available, there is a need for a method that supports a smaller memory and uses fewer MIPS without significantly negatively impacting on recognition in all environments.