Network trials of speech recognition indicate that automated services such as call routing and automatic call type recognition (ACTR) can be successfully offered on the network using small vocabulary, speaker independent recognition systems. The recognition method employed in these systems relies on isolation of the target vocabulary item in the incoming signal. If the precise beginning and end of the vocabulary item is known, correct recognition is possible in more than 99 percent of the responses. However, precise location of the item can be assured only in a highly artificial way. A speech specialist examines a spectrogram of the recorded response, marks the beginning and end of the item and verifies the position of the marks by listening to the speech between them. Isolated word recognition systems rely on an endpointing process to place these marks. This process is accomplished by examining the energy profile of the signal and identifying possible beginning and ending points based on a set of rules. Performance of the isolated word recognition method using automatically generated endpoints is still very good (better than 95 percent) when the customer response contains only the desired vocabulary item. However, actual customers using recognition systems often respond with unsolicited input spoken in conjunction with the vocabulary item and under these conditions the isolated word method performs very poorly. In a trail of ACTR service, more than 20 percent of customers responded in such a manner.
Ideally, network services that use speech recognition technology would not place many constraints on the customer. Customers are not likely to be motivated to change their behavior to meet requirements of isolated word recognizers. Customers new to a service or those that use it infrequently should not be expected to respond to a recorded announcement with a carefully articulated utterance of only the target vocabulary item.
One of the important techniques in speech recognition is a procedure referred to as time warping. In order to compare an input pattern, e.g., a spoken word with a reference, each word is divided into a sequence of time frames. In each time frame, parameters representative of acoustic features of the speech pattern are obtained. For each frame of the input word, a frame of the reference word is selected. Measures representative of the similarity or correspondence between each selected pair of frames are obtained responsive to the acoustic feature signals. The similarity measures for the sequence of input and reference word frame pairs are used to determine the global or overall similarity between the input word and the reference word template.
Since there are many different ways of pronouncing the same word, the displacement in time of the acoustic features comprising the word is variable. Different utterances of the same word, even by the same individual, may be widely out of time alignment. The selection of frame pairs is therefore not necessarily linear. Matching, for example, the fourth, fifth and sixth frames of the input utterance with the fourth, fifth and sixth frames respectively of the reference word may distort the similarity measure and produce unacceptable errors.
Dynamic time warping arrangements have been developed which align the frames of a test and reference pattern in an efficient or optimal manner. The alignment is optimal in that the global similarity measure assumes an extremum. It may be, for example, that the fifth frame of the test word should be paired with the sixth frame of the reference word to obtain the best similarity measure.
In the speech recognition arrangement disclosed in the U.S. Pat. No. 4,519,094 issued to M. K. Brown et al. on May 21, 1985, a speech pattern is recognized as one of a plurality of reference patterns for which acoustic feature signal templates are stored. Each template includes a time frame (e.g., 10 millisecond) sequence of spectral parameters e.g., LPC and nonspectral e.g., acoustic energy (E) normalized to the peak energy over an utterance interval. LPC and normalized energy feature signal sequences are produced for an unknown speech pattern. For each time frame, the correspondence between the LPC features of the speech pattern and each reference is measured as well as the correspondence between the energy (E) features. Including the energy features in the comparison reduces errors when background noise and other non-speech events such as a door slam have spectral features very similar to the spectral features of one of the reference patterns. In comparing the unknown speech features to those of the reference templates, the dynamic time warp distance DT=D.sub.LPC +.alpha.D.sub.E is used where .alpha. is a weighting factor selected to minimize the probability of erroneous recognition. Although the Brown arrangement represents an advance in the art, its performance is highly dependent on the weighting factor used during the time warp, pattern matching procedure. Determining the correct weighting factor in a particular application is difficult because it is based on error probability distributions that vary depending on the characteristics of both the reference patterns and the input speech. In addition, since the Brown arrangement relies on an endpointing procedure, performance is substantially degraded in typical applications where responses are unconstrained.
In view of the foregoing, a need exists in the art for an improved speech recognition arrangement that substantially reduces the likelihood of recognizing background noise or other non-speech events as a reference speech pattern, particularly in applications where endpointing is unacceptable.