The present invention relates to speech recognition. In particular, the present invention relates to the use of segment models to perform speech recognition.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector is typically multi-dimensional and represents a single frame of the speech signal.
To identify a most likely sequence of words, the feature vectors are applied to one or more models that have been trained using a training text. Typically, this involves applying the feature vectors to a frame-based acoustic model in which a single frame state is associated with a single feature vector.
Recently, however, segment models have been introduced that associate multiple feature vectors with a single segment state. The segment models are thought to provide a more accurate model of large-scale transitions in human speech.
Although current segment models provide improved modeling of large-scale transitions, their training time and recognition time are less than optimum. As such, more efficient segment models are needed.