1. Field of the Invention
This invention relates to speech recognition, and more particularly to a method and apparatus for dynamic beam control in a Viterbi search.
2. Description of the Related Art
Speech or voice recognition has become very popular to increase work efficiency. Several techniques are used in speech recognition processes to recognize human voice. Speech recognition also functions as a pipeline to convert digital audio signals coming from devices, such as a personal computer (PC) sound card, to recognized speech. These signals may pass through several stages, where various mathematical and statistical processes are used to determine what has actually been said.
Many speech recognition applications have databases containing thousands of frequencies or “phonemes” (also known as “phones” in speech recognition systems). A phoneme is the smallest unit of speech in a language or dialect (i.e., the smallest unit of sound that can distinguish two words in a language). The utterance of one phoneme is different from another. Therefore, if one phoneme replaces another in a word, the word would have a different meaning. For example, if the “B” in “bat” were replaced by the phoneme “R,” the meaning would change to “rat.” The phoneme databases are used to match the audio frequency bands that were sampled. For example, if an incoming frequency sounds like a “T,” an application will try to match it to the corresponding phoneme in the database. Also, adjacent phones, known as context, can effect pronunciation. For example, the “T” in “that” sounds different from the “T” in, “truck.” The phone with fixed left (right) context is generally knows as a “left (right) biphone.” The phone with fixed left and right contexts is knows as a “triphone.” The phoneme databases may contain many entries for each phoneme corresponding to bi- or triphones. Each phoneme is tagged with a feature number, which is then assigned to the incoming signal.
There can be so many variations in sound due to how words are spoken that it is almost impossible to exactly match an incoming sound to an entry in the database. Moreover, different people may pronounce the same word differently. Further, the environment also adds its own share of noise. Thus, applications must use complex techniques to approximate an incoming sound and figure out which phonemes are being used.
Another problem in speech recognition involves determining when a phoneme (or smaller units) ends and the next one begins. For problems like this, a technique called hidden Markov model (HMM) may be implemented. A HMM provides a pattern matching approach to speech recognition.
An HMM is generally defined by the following elements: First, the number of states in the model, N; next, a state-transition matrix A where aij is the probability of the process moving from state qi to state qi at time t=1, 2, . . . and given that the process is at state qi at time t−1; the observation probability distribution, bi({right arrow over (o)}), i=1 . . . , N for all states, qi, i=1, . . . N; and the initial state probability πi for i=1, . . . N.
In order to perform speech recognition using a HMM, languages are typically broken down into a limited group of phonemes. For example, the English language may be broken down into approximately 40-50 phonemes. One should note, however, that if other units are used, such as tri-phones, the limited group may consist of several thousands of tri-phones. A stochastic model of each of the units (i.e., phones) is then created. Given an acoustical observation, the most likely phoneme corresponding to the observation can then be determined. One should note, however, that if context units are used, such as bi-phones or tri-phones, the limited group may consist of several thousands of units. Therefore, a stochastic model for each of the units would be created. A method for determining the most likely phoneme corresponding to the acoustical observation uses Viterbi (named after A. J. Viterbi) scoring.