The present invention relates to speech recognition. In particular, the present invention relates to the use of models to perform speech recognition.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector is typically multi-dimensional and represents a single frame of the speech signal.
To identify a most likely sequence of words, the feature vectors are applied to one or more models that have been trained using a training text. Typically, this involves applying the feature vectors to a frame-based acoustic model in which a single frame state is associated with a single feature vector. Recently, however, segment models have been introduced that associate multiple feature vectors with a single segment state. The segment models are thought to provide a more accurate model of large-scale transitions in human speech.
All models, both frame based and segment based, determine a probability for an acoustic unit. In initial speech recognition systems, the acoustic unit was an entire word. However, such systems required a large amount of modeling data since each word in the language had to be modeled separately. For example, if a language contains 10,000 words, the recognition system needed to 10,000 models.
To reduce the number of models needed, the art began using smaller acoustic units. Examples of such smaller units include phonemes, which represent individual sounds in words, and senones, which represent individual states within phonemes. Other recognition systems used diphones, which represent an acoustic unit spanning from the center of one phoneme to the center of a neighboring phoneme.
When determining the probability of a sequence of feature vectors, speech recognition systems of the prior art did not mix different types of acoustic units. Thus, when determining a probability using a phoneme acoustic model, all of the acoustic units under consideration would be phonemes. The prior art did not use phonemes for some segments of the speech signal and senones for other parts of the speech signal. Because of this, developers had to decide between using larger units that worked well with segment models or using smaller units that were easier to train and required less data.
During speech recognition, the probability of an individual acoustic unit is often determined using asset of Gaussian distributions. At a minimum, a single Gaussian distribution is provided for each feature vector spanned by the acoustic units.
The Gaussian distributions are formed from training data and indicate the probability of a feature vector having a specific value for a specific acoustic unit. The distributions are formed by measuring the values of the feature vectors that are generated by a trainer reciting from a training text. For example, for every occurrence of the phoneme xe2x80x9cthxe2x80x9d in the training text, the resulting values of the feature vectors are measured and used to generate the Gaussian distribution.
Because different speakers produce different speech signals, a single Gaussian distribution for an acoustic unit can sometimes produce a high error rate in speech recognition simply because the observed feature vectors were produced by a different speaker than the speaker used to train the system. To overcome this, the prior art introduced a mixture of Gaussian distributions for each acoustic unit. Within each mixture, a separate Gaussian is generated for one group of speakers. For example, there could be one Gaussian for the male speakers and one Gaussian for the female speakers.
Using a mixture of Guassians, each acoustic unit has multiple targets located at the mean of each Guassian. Thus, for a particular acoustic unit, one target may be from a male training voice and another target may be from a female training voice.
Since the probability associated with each acoustic unit is determined serially under the prior art, it is possible to use targets associated with two different groups of speakers when determining the probabilities of feature vectors for two neighboring acoustic units. Thus, in one acoustic unit, a target associated with a male trainer may be used to determine the probability of a set of feature vectors and in the next acoustic unit a target associated with a female speaker may be used to determine the probability of a set of feature vectors. Such a discontinuity in the targets between neighboring acoustic units is undesirable because it represents a trajectory in the speech signal that never occurs in the training data. Such a trajectory is known as a phantom trajectory in the art.
A speech recognition method and system utilize an acoustic model that is capable of providing probabilities for both a large acoustic unit and an acoustic sub-unit. Each of these probabilities describes the likelihood of a set of feature vectors from a series of feature vectors representing a speech signal. The large acoustic unit is formed from a plurality of acoustic sub-units. At least one sub-unit probability and at least one large unit probability from the acoustic model are used by a decoder to generate a score for a sequence of hypothesized words. When combined, the acoustic sub-units associated with all of the sub-unit probabilities used to determine the score span fewer than all of the feature vectors in the series of feature vectors.
In some embodiments of the invention, an overlapping decoding technique is used. In this decoding system, two acoustic probabilities are determined for two sets of feature vectors wherein the two sets of feature vectors are different from each other but include at least one common feature vector. A most likely sequence of hypothesized words is then identified using the two acoustic probabilities.