The invention relates generally to speech recognition systems, and relates more specifically to a segmentation approach used in speech recognition systems.
Most speech recognition systems include a recognizer that processes utterance data and detects modeling units that typically correspond to linguistic phonemes. Recognizers typically generate several types of data including measurement data that is provided to a model computation stage which evaluates the measurement data to determine the likelihood that certain utterance data represents particular phonemes. As used herein, the term xe2x80x9cutterancexe2x80x9d refers to one or more sounds generated either by humans or by machines. Examples of an utterance include, but are not limited to, a single sound, any two or more sounds, a single word or two or more words. Utterance data is a data representation of an utterance.
Many recognizers are either frame-based or segment-based. Frame-based recognizers analyze portions of utterance data (xe2x80x9cframesxe2x80x9d) and determine the likelihood that a particular frame of utterance data is part of a particular linguistic unit such as a phoneme. For example, a frame-based recognizer may analyze a 10 ms frame of utterance data and determine the likelihood that the 10 ms frame of utterance data is part of the phoneme xe2x80x9cfxe2x80x9d. Frames that are determined to be part of the same phoneme are then grouped together.
In contrast to frame-based recognizers, segment-based recognizers, often referred to as xe2x80x9csegmenters,xe2x80x9d analyze frames of utterance data to find logical segments that define linguistic units contained in the utterance data. Each segment is defined by two boundaries that define the beginning and end of a linguistic unit. Boundaries are typically characterized by a sharp rise or fall in utterance data values. Segmenters analyze frame data looking for segment boundaries. Once the boundaries (and segments) have been identified, segmenters determine the probability that each segment is a particular linguistic unit, e.g., an xe2x80x9cfxe2x80x9d.
Segmenters tend to provide a relatively higher level of accuracy than frame-based recognizers because they are attempting to match an entire linguistic unit to a set of known linguistic units instead of trying to match a piece of a linguistic unit to a set of known linguistic units. However, frame-based recognizers generally provide better error recovery than segmenters since segmentation occurs during recognition instead of before recognition. That is, it can be difficult to recover from a segmentation error in segmenters, e.g., missing the first linguistic unit in a word. Some segmenters generate a large number of segments and then select an optimum set of segments to improve accuracy. However, the amount of computational resources that are required to process the segments is directly related to the number of segments. As a result, segmenters that attempt to improve accuracy by processing large numbers of segments can require significantly more computational resources than their frame-based counterparts.
Finding the boundaries that correspond to linguistic units like phonemes, is notoriously difficult. Given the sloppy nature of speech, sometimes there are no clear acoustic cues for boundaries. The result is that boundaries may be missed which further increases the likelihood that specific phonemes may not be recognized leading to reduced accuracy. Another problem is that boundaries may be incorrectly found in the utterance data where no linguistic units are present. This problem is common in silence regions where background noise is more easily misinterpreted as a linguistic unit. Finding too many boundaries (and segments) adversely affects the performance of speech recognition systems since their speed is highly dependent upon the number of segments processed. Processing segments requires computational resources and it is very important to limit the number of incorrect segment detections.
Based on the foregoing, there is a need for a speech recognizer mechanism that avoids the limitations in the prior approaches. There is a particular need for a speech recognizer mechanism that provides fast response with a relatively high level of accuracy while requiring a reduced amount of computational resources.
The foregoing needs, and other needs that will become apparent from the following description, are achieved by the present invention, a body of received utterance data is processed to determine a set of candidate phonetic unit boundaries that defines a set of candidate phonetic units. The set of candidate phonetic unit boundaries is determined based upon changes in Cepstral coefficient values, changes in utterance energy, changes in phonetic classification, broad category analysis (retroflex, back vowels, front vowels) and sonorant onset detection. The set of candidate phonetic unit boundaries is filtered by priority and proximity to other candidate phonetic units and by silence regions. The set of candidate phonetic units is filtered using no-cross region analysis to generate a set of filtered candidate phonetic units. No-cross region analysis generally involves discarding candidate phonetic units that completely span an energy up, energy down, dip or broad category type no-cross region. Finally, a set of phonetic units is selected from the set of filtered candidate phonetic units based upon differences in utterance energy.