Under present technology, human input into computers usually involves typing in instructions or data with a keyboard or pointing to meaningful areas of the computer display unit with a pointing device such as a mouse. However, most people are much more comfortable communicating using speech. Speech communications are particularly important in places where the written language is very difficult to input into a computer, such as Chinese or Kanji. Consequently, the development of a speech recognizer for a computer would greatly expand the useability and the usefulness of computers.
Many computer speech recognizers have been developed with varying success. Difficulties in recognizing speech arise from acoustic differences occurring each time a word is spoken. These acoustic differences can be caused by differences in each speaker's accent, speaking speed, loudness and vocal quality. Even a single word spoken by a single speaker varies acoustically due to changes in the speaker, such as health or stress levels, or changes in the speaker's environment, such as background noise and room acoustics.
One prior art speech recognition system that has achieved some success is the hidden Markov model (HMM) system. In general, HMM systems are based on modeling each speech signal by some well-defined statistical algorithms that can automatically extract knowledge from speech data. The data needed for the HMM statistical models are obtained by numerous training words being spoken together with a typed indication of the words being spoken. Like all statistical methods, the accuracy of the HMM speech recognition systems improves greatly as the number of spoken training words increases. Given the large number of acoustic variables, the number of spoken training words needed to model accurately the spoken words can be quite large. Consequently, the memory needed to store the models necessary to recognize a large vocabulary of words is extensive, e.g., approximately 28 megabytes. As a result, a system for compressing the modeling data would allow more words to be modeled and stored in less space.
In general, HMM speech recognition systems model each word according to output and transitional probability distributions. The output probability distribution refers to the probabilities of the word having each acoustic feature in a set of predefined acoustic features. The transitional probability distribution for the word refers to the probabilities of a predefined portion, known as a state or frame, of the word being followed by either a new state or a repetition of the current state. For example, the word "dog" has a relatively high output probability of including a hard "d" sound. A transitional probability distribution refers to the probability of the "o" sound being repeated in the next frame and the probability of the "g" occurring in the next state. It should be recognized that a word typically includes 40-50 states, much more than the three phones of the "dog" example, but the transitional and output probability distribution concepts are the same.
Prior art HMM speech recognition systems typically employ vector quantization to divide the acoustic space or frequency range spoken by humans into a predetermined number of acoustic feature models that are given labels called "codewords." The output probabilities for each word are represented by probability values for each codeword. Typically, approximately 200 codewords are chosen to allow accurate modeling of the spoken words with minimum distortion caused by an acoustic feature falling between two adjacent acoustic feature models. Because the range of codewords is chosen to represent all of the possible acoustic features of speech, it is highly unlikely that a single word will have a non-zero probability for every codeword. In fact, most words have a non-zero probability value for a minority of the codewords. As a result, the models for most words are stored with a substantial number of repeated zero probability values. These repeated zero probability values take up much storage space with very little relevant information content.