The present invention relates to speech recognition. In particular, the present invention relates to the use of segment models to perform speech recognition.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector is typically multi-dimensional and represents a single frame of the speech signal.
To identify a most likely sequence of words, the feature vectors are applied to one or more models that have been trained using a training text. Typically, this involves applying the feature vectors to a frame-based acoustic model in which a single frame state is associated with a single feature vector.
Recently, however, segment models have been introduced that associate multiple feature vectors with a single segment state. The segment models are thought to provide a more accurate model of large-scale transitions in human speech.
Although current segment models provide improved modeling of large-scale transitions, their training time and recognition time are less than optimum. As such, more efficient segment models are needed.
A method and apparatus determine the likelihood of a sequence of words based in part on a segment model. The segment model includes trajectory expressions formed as the product of a generation matrix and a parameter matrix. The likelihood of the sequence of words is based in part on a segment probability. The segment probability is derived in part by matching the trajectory expressions to a feature vector matrix that contains a sequence of feature vectors for a segment of speech.
Aspects of the method and apparatus also include training the segment model using such a segment probability.