The present invention relates to temporal pattern recognition. In particular, the present invention relates to the use of segment models to perform temporal pattern recognition.
Temporal pattern recognition refers to the identification of salient trends in time-varying signals, e.g., speech, handwriting, stock quotes, etc. For instance, the patterns to be recognized in speech are syllables, words, phrases, or other phonologically significant linguistic units. For handwriting, the patterns are strokes of written letters, words, ideograms, or logograms. Though the following descriptions focus on speech recognition systems, the same principles apply to other temporal pattern recognition systems as well.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector represents a section of the speech signal.
The feature vectors are then used to identify the most likely sequence of words that would have generated the sequence of feature vectors. Typically, this involves applying the feature vectors to a frame-based acoustic model to determine the most likely sequences of sub-word units, typically senones, and then using a language model to determine which of these sequences of sub-word units is most likely to appear in the language. This most likely sequence of sub-word units is then identified as the recognized speech.
Typically, the frame-based acoustic model is a Hidden Markov Model that is constructed from a series of interconnected states. Each state includes a set of probability functions that are used to determine the likelihood that the state would generate a frame""s feature vector. The model also includes probabilities for transitioning between states as well as a list of allowable state sequences for each possible sub-word unit.
In order to make frame-based Hidden Markov Models computationally feasible, several assumptions must be made. First, the transition probability between two states is assumed to only depend on the two states. It does not depend on any earlier states or any previous feature vectors.
Second, the duration of a sub-word unit is assumed to be set by repeating one or more states. This second assumption means that sub-word units of longer duration are less favored by the model than shorter duration sub-word units because the repetition of a state will result in lower probability. In fact, under this second assumption, a sub-word unit has its highest duration probability when its duration is as short as possible. However, this does not match human speech where the most probable duration for a sub-word unit is usually longer than this minimum duration.
The third assumption used in frame-based Hidden Markov Models is that the probability of generating a feature vector is assumed to be dependent only on the current feature vector and the current state. It is not dependent on past states or past feature vectors. This is sometimes referred to as conditional independent observation and is the reason that Hidden Markov Models are known as quasi-stationary models. Because they are piecewise stationary, such models do not model large-scale transitions in speech well. Thus, these models usually overlook transitions between neighboring states as long as the quasi-stationary portion are modeled well by the HMM""s output probability distributions.
The failure of frame-based Hidden Markov Models to track long-range transitions well and the problem frame-based Hidden, Markov Models have in modeling the correct duration for a sub-word unit are thought to be limiting factors in the performance of some speech recognition systems. To overcome these problems, some prior art systems have developed models for longer sections of speech. These longer models, known as segment models, take multiple past feature vectors into consideration when making a determination of the likelihood of a particular segment unit. In addition, these models are explicitly trained to detect the most likely duration for each segment unit.
Although segment models generally model long-range transitions better than frame-based Hidden Markov Models, they do not model steady states as well. Some attempts have been made to improve the resolution of segment models by shortening the segment size. However, the performance of these segment models is still not as accurate as frame-based Hidden Markov Models.
Some systems have attempted to combine frame-based Hidden Markov Models with segment models using a two-tier approach. In these systems, frame-based Hidden Markov Models are first used to identify n-best possible sequences of sub-word units (or a lattice of sub-word units). Each of these n-best sequences is then provided to a segment model, which identifies the most likely sequence. One problem with such a two-tier approach is that the frame-based Hidden Markov Model may discard the best sequence of sub-word units before the segment model has an opportunity to determine the sequence""s likelihood. In addition, under such systems, the frame-based Hidden Markov Model and the Segment Model are trained separately. As such, they may not be trained to work ideally with each other.
Because of these problems, a new recognition system is needed that models both long-range transitions and quasi-stationary portions of a temporal signal and that accurately determines the duration of sub-units in the temporal signal.
A method and apparatus is provided for identifying patterns from a series of feature vectors representing a temporal signal. The method and apparatus use both a frame-based model and a segment model in a unified framework. The frame-based model determines the probability of an individual feature vector given a frame state. The segment model determines the probability of sub-sequences of feature vectors given a single segment state. The probabilities from the frame-based model and the segment model are then combined to form a single path score that is indicative of the probability of a sequence of patterns. Under one embodiment of this framework, the boundaries of sub-units are synchronized between both models. Such a path with highest joint probability will be the best hypothesis generated by the unified models and thus provide the best recognition result.
Another aspect of the invention is the use of a frame-based model and a segment model to segment feature vectors during model training. Under this aspect of the invention, the frame-based model and the segment model are used together to identify probabilities associated with different segmentations. The segmentation with the highest probability is then used to retrain the frame-based model and the segment model. The revised models are then used to identify a new segmentation. This iterative process continues until both the frame-based model and segment model converge.