I. Field of the Invention
In general, the present invention relates to speech processing (such as speech recognition). In particular, the invention relates to apparatus and method for characterizing speech as a string of spectral vectors and/or labels representing predefined prototype vectors of speech.
II. Description of the Problem
In speech processing, speech is generally represented by an n-dimensional space in which each dimension corresponds to some prescribed acoustic feature. For example, each component may represent a amplitude of energy in a respective frequency band. For a given time interval of speech, each component will have a respective amplitude. Taken together, the n amplitudes for the given time interval represent an n-component vector in the n-dimensional space.
Based on a known sample text uttered during a training period, the n-dimensional space is divided into a fixed number of regions by some clustering algorithm. Each region represents sounds of a common prescribed type: sounds having component values which are within regional bounds. For each region, a prototype vector is defined to represent the region.
The prototype vectors are defined and stored for later processing. When an unknown speech input is uttered, for each time interval, a value is measured or computed for each of the n components, where each component is referred to as a feature. The values of all of the features are consolidated to form an n-component feature vector for a time interval.
In some instances, the feature vectors are used in subsequent processing.
In other instances, each feature vector is associated with one of the predefined prototype vector and the associated prototype vectors are used in subsequent processing.
In associating prototype vectors with feature vectors, the feature vector for each time interval is typically compared to each prototype vector. Based on a predefined closeness measure, the distance between the feature vector and each prototype vector is determined and the closest prototype vector is selected.
A speech type of event, such as a word or a phoneme, is characterized by a sequence of feature vectors in the time period over which the speech event was produced. Some prior art accounts for temporal variations in the generation of feature vector sequences. These variations may result from differences in speech between speakers or for a single speaker speaking at different times. The temporal variations are addressed by a process referred to as time warping in which time periods are stretched or shrunk so that the time period of a feature vector sequence conforms to the time period of a reference prototype vector sequence, called a template. Oftentimes, the resultant feature vector sequence is styled as a "time normalized" feature vector sequence.
Because feature vectors or prototype vectors (or representations thereof) associated with the feature vectors or both are used in subsequent speech processing, the proper characterization of the feature vectors and proper selection of the closest prototype vector for each feature vector is critical.
The relationship between a feature vector and the prototype vectors has normally, in the past, been static; there has been a fixed set of prototype vectors and a feature vector based on the values of set features.
However, due to ambient noise, signal drift, changes in the speech production of the talker, differences between talkers or a combination of these, signal traits may vary over time. That is, the acoustic traits of the training data from which the prototype vectors are derived may differ from the acoustic traits of the data from which the test or new feature vectors are derived. The fit of the prototype vectors to the new data traits is normally not as good as to the original training data. This affects the relationship between the prototype vectors and later-generated feature vectors, which results in a degradation of performance in the speech processor.