The present invention relates to speech recognition generally and to speaker dependent recognition in the presence of noise, in particular.
Speech recognition in noisy environments is a well studied, yet difficult task. One such task is characterized by the following parameters:
1. The recognition is speaker dependent, where the reference templates are created from speech utterances, spoken by the user in a designated xe2x80x9ctraining sessionxe2x80x9d;
2. It is desired to minimize the number of training utterances to a small number (1-3), for which it is known in the art that a dynamic time warping (DTW) matching algorithm works better than a hidden markov model (HMM) algorithm;
3. The phrases to be recognized are isolated words;
4. The training phase is relatively noise-free, whereas the recognition needs to cope with additive environmental noise;
5. The environmental noise is unknown to the system prior to the instant the user pushes a push to talk (PTT) button and starts speaking;
6. The environmental noise has both stationary and non-stationary components; and
7. The system has limited fast-access memory, so that it is impossible to run DTW matching against all reference templates, in real-time and in a word-spotting manner. Therefore a two-stage processing is required, where the first stage is a voice activity detector (VAD), and the second stage is a DTW matcher.
Two difficulties imposed by the noise in the recognition phase are:
1. Mismatch in the acoustics between the training and recognition phases; and
2. Inaccurate VAD estimates of the word endpoints in the recognition phase.
These two problems lead to recognition errors.
There are many techniques known in the art to deal with the acoustic mismatch problem. A good review can be found in Jean-Claude Junqua and Jean-Paul Haton, Robustness in Automatic Speech Recognition, Kluwer Academic Publishers, 1996. One technique is described in U.S. Pat. No. 5,778,342 to Erell et al.
The problem of inaccurate endpoints has been less covered in the art. One solution was given in the form of relaxed-endpoint DTW and is described in the following: Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993; Ilan D. Shallom, Raziel Haimi-Cohen and Tal Golan, xe2x80x9cDynamic Time Warping with Boundaries Constraint Relaxationxe2x80x9d, IEEE Conference in Israel, 1989, pages 1-4; and U.S. Pat. No. 5,732,394 to Nakadai et al.
In normal DTW, a sequence of spectral parameters from the speech start to end point is stored as an input speech pattern. The DTW operation matches the unknown speech pattern with the content of each reference template and calculates a distance measure between them. This is performed using the graph of FIG. 1A, to which reference is now briefly made. The frames of the input speech pattern are placed on the X axis and those of the current reference pattern are placed on the Y axis. A path is made through the graph, starting at the lower left corner and ending at the upper right corner, where the corners are defined as the endpoints of the test and reference utterances.
However, in the relaxed-endpoint solution, shown in FIG. 1B to which reference is now made, the DTW paths are not constrained to start or end at the exact endpoints of the test and reference utterances. Instead, paths can start or end within a given range (delta and Qmax_delta) of the corners. This method indeed eliminates some of the errors due to inaccurate endpoints.
However, the relaxed-endpoint solutions have several disadvantages. One disadvantage is illustrated in FIG. 2, to which reference is now briefly made: when there exist two vocabulary words, and one word is similar to a part of the second word (this is shown by the section marked xe2x80x9cmatchxe2x80x9d), the recognition system might incorrectly indicate that utterance of the first (longer) word matches the reference template of the second (shorter) word.
Other disadvantages of the relaxed-endpoint methods are specific to the method. For example, in the article by Shallom, it is necessary to normalize, for each point on the DTW grid, the DTW accumulated score by the path length, since the relaxation of the beginning point allows now for multiple paths of different lengths. The length normalization introduces an extra computation load that does not exist in standard DTW. Also, because of the normalization, the standard DTW solution for the best matching path is in fact not optimal. For example, in U.S. Pat. No. 5,732,394, there is a higher computation load since several DTW matches are performed for each pair of test and reference patterns, instead of one.
Another solution to the problem of inaccurate endpoints is given in the following articles: Tom Claes and Dirk Van Compernolle, xe2x80x9cSNR-Normalization for Robust Speech Recognitionxe2x80x9d, ICASSP 96, 1996, pages 331-334; Vijay Raman and Vidhya Ramanujam, xe2x80x9cRobustness Issues and Solutions in Speech Recognition Based Telephony Servicesxe2x80x9d, ICASSP 97, 1997, pages 1523-1526; and Olli Viikki and Kari Laurila, xe2x80x9cNoise Robust HMM-Based Speech Recognition Using Segmental Cepstral Feature Vector Normalizationxe2x80x9d, ESCAxe2x80x94NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, 1997, pages 107-110.
The approach in these publications is that of a single-stage HMM-based system, running in real-time on the input speech, without a VAD. To deal with the noise segments, the HMM model of the word is concatenated on both ends with HMM model of the noise, to form a composite model of the whole utterance.
The above solution has two disadvantages: (a) This solution cannot be applied to tasks that are constrained by items (2) and (7) above; and (b) the one-pass solutions lose some of their efficiency in dealing with the acoustic mismatch (problem 1) since in one-pass algorithms there is no accurate information of the noise level. This occurs because the word endpoints are not determined prior to the recognition and therefore, the noise cannot be estimated from speech-free segments. This inaccurate noise estimate leads to recognition errors.
Another prior art method that also uses concatenated noise-speech-noise models for a DTW-based system is proposed in the article by B. Patrick Landell, Robert E. Wohlford and Lawrence G. Bahler entitled xe2x80x9cImproved Speech Recognition in Noisexe2x80x9d, ICASSP 86, TOKYO, 1986, pages 749-751. Again, the idea is to avoid the use of endpoints in the DTW matching by using noise-templates that are augmented to the speech templates and matching the whole utterance to the concatenated templates. Also, to efficiently combat the acoustic mismatch problem, it is assumed that, prior to the beginning of the utterance, the system has knowledge of the noise, so that the reference templates can be adapted to the noise prior to the beginning of the matching process.
No details are given in the Landell et al. article for how the noise templates are constructed and how to implement the DTW matching against the concatenated noise-speech-noise templates. Unlike with HMM, where the method is straightforward, this is a non-trivial problem in DTW since the DTW alignment constraints are tight but yet there is no accurate knowledge of noise template duration, since it is not known when the speaker utters the word after pushing the PTT.
Also, the Landell et al. article assumes that the noise acoustic features can be estimated from past observations of the noise, from before the speaker pushed the PTT button. For Landell et al.""s system, which was designed for an air force cockpit where the noise is fairly constant, this might be sufficient. However, with variable noise such as encountered during, for example, regular use of mobile phones, this past estimate can be inaccurate and can lead to recognition errors.
In all speech recognition applications, e.g., in voice-dialing by name, it is very important to reject utterances that are either not hi the vocabulary, or are so badly pronounced that they yield erroneous recognition. This is usually done by setting a threshold to the recognition score (e.g., the DTW or IMM score), i.e., the recognition result is accepted oily if the score is significant enough relative to the threshold.
It is generally difficult to achieve efficient rejection of out-of-vocabulary or mispronounced utterances, without sacrificing also some rejection of in-vocabulary, well-pronounced utterances. The problem is difficult because of the high variability in the values of the best-match scores. Methods that are known in the alt for improving the rejection capability of HMM systems include mostly the usage of a xe2x80x9cgeneral speechxe2x80x9d template (these are discussed in the previously mentioned article by Raman, in U.S. Pat. No. 5,732,394 and in the article by Richard C. Rose and Douglas B. Paul, xe2x80x9cA Hidden Markov Model Based Keyword Recognition Systemxe2x80x9d, ICASSP ""90, 1990, page 129. Alternatively, as discussed in the article by Herve Bourlard, Bait D""hoore, and Jean-Marc Boite, xe2x80x9cOptimizing Recognition and Rejection Performance in Wordspotting Systemsxe2x80x9d, ICASSP ""94, 1994, page 1-373, the rejection capability can be improved by using as threshold other competing candidate patterns.
Even when such score-normalization methods are efficient to the extent that the variability due to the specific utterance is minimized, there still remains a problem due to the variability in the environment. The matching between the test-utterance and the templates is always worse in noisy conditions relative to the matching in quiet conditions. This creates a problem for the rejection mechanism. Suppose that the rejection threshold on the normalization score is set to an optimal compromise between rejection of out-of-vocabulary words and misdetection of in-vocabulary words for quiet conditions. Then it might happen that in noisy conditions this compromise is not optimal. For example, the number of misdetections of in-vocabulary words will significantly increase. It may be desired in this case to relax the threshold, thereby to reduce the number of misdetections of in-vocabulary words, even at the expense of less rejection of out-of-vocabulary words.
A solution to the problem is to adapt the threshold to the acoustic conditions, e.g., make the threshold a function of the signal to noise ratio, as in U.S. Pat. No. 5,778,342. This solution requires the estimation of the noise from speech-free waveform segments, which, in turn requires knowledge of the speech end points, which are not known to a sufficient precision. For example, if the interfering noise is a short burst that is partially overlapping with the speech, the burst may have been erroneously identified by the VAD as part of the speech. Then, the signal beyond the endpoints will not contain the noise burst, and the SNR estimator will overestimate the SNR, leading to a badly-adapted rejection threshold.
Another source of score-variability occurs in speaker dependent systems which allow the user to register either one word or two connected words. For example, in Voice Activated Dialing by name, a user may register either a first name, last name, or a full name. In the first two cases, the utterance contains one word, whereas in the second case it contains two words. It is typically the case that two-word utterances have more variability in their pronunciation (e.g. the duration of the pause in between may vary significantly), so that the DTW or HMM matching scores typically differ than the ones encountered with one-word utterances. For example, with a standard DTW system, the score is typically higher for two-word utterances. (This statement is true even through the DTW scoring normalizes the accumulated score by the DTW path length, which is longer for two-word utterances than for one-word.) This creates a problem for the rejection mechanism, since two-word utterances are rejected more than one-word utterances. This over-rejection is not xe2x80x9cjustifiedxe2x80x9d from the performance point of view, since out-of-vocabulary two-word utterances are less likely to be accepted than one-word utterances.