The present invention relates generally to speech recognition systems, and more particularly, the invention relates to a system for representing a given speech utterance as a digital word prototype. The prototype is based on segment-based phoneme similarity data, resulting in a highly compact, data-driven representation of speech.
Conventional speech processing technology starts with a digitized speech utterance and then analyzes the digitized data in blocks comprising a predetermined number of samples. Thus, the conventional system breaks the incoming speech signal into time segments so that each segment can be analyzed and coded separately. With the conventional technique it is therefore common practice to store a fixed number of feature parameters per time segment or frame. Common practice is to analyze speech at a fixed frame rate of 50 to 100 frames per second and represent speech by a fixed number of feature parameters in each frame. These feature parameters are usually the parameters of a parametric model of the short term spectral shape and their derivatives.
Hoshimi et al., U.S. Pat. No. 5,345,536 proposed a speech representation also employing a fixed number of feature parameters per analysis frame, in which the feature parameters are phoneme similarity values. In this representation, it is not necessary to store all phoneme similarity values for each frame, but to store the phoneme similarity value and phoneme index of the M (e.g. M=6) largest phoneme similarity values (e.g. 12 parameters per frame, or 1200 parameters per second, assuming 100 frames per second).
These conventional high data rate systems then compare the feature parameters, element by element, between the reference prototype (derived from the training data) and the unknown speech data. The number of comparisons is thus proportional to the square of the number of parameters used. Hence high data rate systems produce a high computational overhead that may rule out slower, less expensive processors that would otherwise be desirable for use in low cost consumer products.
In a similar approach, Morin et al., U.S. Pat. No. 5,684,925, proposed an alternative approach to the digitized speech coding problem. This approach replaces the frame-based prototype with a feature-based prototype. More specifically, its recognition strategy is based on "targets" which characterize reliably found regions of high phoneme similarity (i.e., the number of high similarity regions found in S segments of equal time duration). Unlike frames, targets do not occur at a fixed rate per second. Instead of devoting equal computational energy to each frame in the utterance (as other conventional systems do), this approach concentrates its computational energy on only those high similarity regions with features that rise above a predetermined similarity threshold. This results in a data-driven digital prototype that can be used to electronically represent speech with roughly a fivefold to tenfold reduction in data rate. Because of the square law relationship described above, the reduced data rate substantially reduces computational overhead.
While the feature-based prototype performed well as a fast-match stage for finding the best word candidates, the feature-based prototype is not accurate and robust enough as a speech recognizer that selects the one best word candidate, especially in the case of noise and channel mismatch. Degradation was shown to come from (1) the use of thresholds in the detection of high similarity regions within the similarity time series and (2) the frame-based segmentation method in which segments were identified by dividing the utterance into S segments of equal time duration.
Thus, the present invention replaces the frame-based prototype with a more robust segment-based prototype. Instead of a discrete approach, a continuous approach is implemented to more completely account for the information held within the high similarity regions. Rather than using a threshold detection method, each of the phoneme similarity values are summed together to construct a word model. This continuous approach results in greater resolution, and yet still provides a compact word model size.
Furthermore, the segment-based prototype of the present invention employs an improved segmentation method. The segmentation method divides the utterance into S segments based on the sum of the phoneme similarity values in each segment. This allows for a fast static alignment of the test utterance that is independent of time.
Generally, the present invention represents a given speech utterance as a digital word prototype according to the following method. The word prototype is built by providing at least one utterance of training speech for a particular word. In order to increase the reliability of the phoneme similarity data found in the training data, two or more training utterances may also be used to construct the digital word prototype. Training utterances may be obtained from a given single speaker ("Speaker-Dependent" training), a large number of highly diverse representative speakers ("Speaker-Independent" training) or some other distribution of speakers (e.g. "Cross-Speaker" training).
For each given spoken utterance, the corresponding input signal is processed to obtain phoneme similarity values for each phoneme symbol in each time frame. The spoken utterance is then segmented. The phoneme similarity data is first normalized by subtracting out the background noise level found within the non-speech part of the input signal. To do so, the normalized phoneme similarity data is divided into segments, such that the sum of all normalized phoneme similarity values in a segment are equal for each segment. For each particular phoneme, a sum is determined from each of the normalized phoneme similarity values within each of the segments. Next, a word model is constructed from the normalized phoneme similarity data. In other words, the word model is represented by a vector of summation values that compactly correlate to the normalized phoneme similarity data. Lastly, the results of individually processed utterances for a given spoken word (i.e., the individual word models) are combined to produce a digital word prototype that electronically represents the given spoken word.
For a more complete understanding of the invention, its objects and advantages refer to the following specification and to the accompanying drawings.