I. Field of the Invention
The present invention relates to speech processing. More particularly, the present invention relates to a system and method for the automatic recognition of spoken words or phrases.
II. Description of the Related Art
Digital processing of speech signals has found widespread use, particularly in cellular telephone and PCS applications. One digital speech processing technique is that of speech recognition. The use of speech recognition is gaining importance due to safety reasons. For example, speech recognition may be used to replace the manual task of pushing buttons on a cellular phone keypad. This is especially important when a user is initiating a telephone call while driving a car. When using a phone without speech recognition, the driver must remove one hand from the steering wheel and look at the phone keypad while pushing the buttons to dial the call. These acts increase the likelihood of a car accident. Speech recognition allows the driver to place telephone calls while continuously watching the road and maintaining both hands on the steering wheel. Handsfree carkits containing speech recognition will likely be a legislated requirement in future systems for safety reasons.
Speaker-dependent speech recognition, the most common type in use today, operates in two phases: a training phase and a recognition phase. In the training phase, the speech recognition system prompts the user to speak each of the words in the vocabulary once or twice so it can learn the characteristics of the user""s speech for these particular words or phrases. The recognition vocabulary sizes are typically small (less than 50 words) and the speech recognition system will only achieve high recognition accuracy on the user that trained it. An example of a vocabulary for a handsfree carkit system would include the digits on the keypad, the keywords xe2x80x9ccallxe2x80x9d, xe2x80x9csendxe2x80x9d, xe2x80x9cdialxe2x80x9d, xe2x80x9ccancelxe2x80x9d, xe2x80x9cclearxe2x80x9d, xe2x80x9caddxe2x80x9d, xe2x80x9cdeletexe2x80x9d, xe2x80x9chistoryxe2x80x9d, xe2x80x9cprogramxe2x80x9d, xe2x80x9cyesxe2x80x9d, and xe2x80x9cnoxe2x80x9d, as well as 20 names of commonly-called coworkers, friends, or family members. Once training is complete, the user can initiate calls in the recognition phase by speaking the trained keywords. For example, if the name xe2x80x9cJohnxe2x80x9d was one of the trained names, the user can initiate a call to John by saying the phrase xe2x80x9cCall John.xe2x80x9d The speech recognition system recognizes the words xe2x80x9cCallxe2x80x9d and xe2x80x9cJohnxe2x80x9d, and dials the number that the user had previously entered as John""s telephone number.
A block diagram of a training unit 6 of a speaker-dependent speech recognition system is shown in FIG. 1. Training unit 6 receives as input s(n), a set of digitized speech samples for the word or phrase to be trained. The speech signal s(n) is passed through parameter determination block 7, which produces a template of N parameters {p(n) n=1. . . N} capturing the characteristics of the user""s pronunciation of the particular word or phrase. Parameter determination unit 7 may implement any of a number of speech parameter determination techniques, many of which are well-known in the art. An exemplary embodiment of a parameter determination technique is the vocoder encoder described in U.S. Pat. No. 5,414,796, entitled xe2x80x9cVARIABLE RATE VOCODER,xe2x80x9d which is assigned to the assignee of the present invention and incorporated by reference herein. An alternative embodiment of a parameter determination technique is a fast fourier transform (FFT), where the N parameters are the N FFT coefficients. Other embodiments derive parameters based on the FFT coefficients. Each spoken word or phrase produces one template of N parameters that is stored in template database 8. After training is completed over M vocabulary words, template database 8 contains M templates, each containing N parameters. Template database 8 is stored into some type of non-volatile memory so that the templates stay resident when the power is turned off.
FIG. 2 is a block diagram of speech recognition unit 10, which operates during the recognition phase of a speaker-dependent speech recognition system. Speech recognition unit 10 comprises template database 14, which in general will be template database 8 from training unit 6. The input to speech recognition unit 10 is digitized input speech x(n), which is the speech to be recognized. The input speech x(n) is passed into parameter determination block 12, which performs the same parameter determination technique as parameter determination block 7 of training unit 6. Parameter determination block 12 produces a recognition template of N parameters {t(n) n=1 . . . N} that models the characteristics of input speech x(n). Recognition template t(n) is then passed to pattern comparison block 16 that performs a pattern comparison between template t(n) and all the templates stored in template database 14. The distances between template t(n) and each of the templates in template database 14 are forwarded to decision block 18, which selects from template database 14 the template that most closely matches recognition template t(n). The output of decision block 18 is the decision as to which word in the vocabulary was spoken.
Recognition accuracy is a measure of how well a recognition system correctly recognizes spoken words or phrases in the vocabulary. For example, a recognition accuracy of 95% indicates that the recognition unit correctly recognizes words in the vocabulary 95 times out of 100. In a traditional speech recognition system, the recognition accuracy is severely degraded in the presence of noise. The main reason for the loss of accuracy is that the training phase typically occurs in a quiet environment but the recognition typically occurs in a noisy environment. For example, a handsfree carkit speech recognition system is usually trained while the car is sitting in a garage or parked in the driveway, so the engine and air conditioning are not running and the windows are usually rolled up. However, recognition is normally used while the car is moving, so the engine is running, there is road and wind noise present, the windows may be down, etc. As a result of the disparity in noise level between the training and recognition phases, the recognition template does not form a good match with any of the templates obtained during training. This increases the likelihood of a recognition error or failure.
FIG. 3 illustrates a speech recognition unit 20 which must perform speech recognition in the presence of noise. As shown in FIG. 3, summer 22 adds speech signal x(n) with noise signal w(n) to produce noise-corrupted speech signal r(n). It should be understood that summer 22 is not a physical element of the system, but is an artifact of a noisy environment. The noise-corrupted speech signal r(n) is input to parameter determination block 24, which produces noise-corrupted template t1(n). Pattern comparison block 28 compares template t1(n) with all the templates in template database 26, which was constructed in a quiet environment. Since noise-corrupted template t1(n) does not exactly match any of the training templates, there is a high probability that the decision produced by decision block 30 may be a recognition error or failure.
The present invention is a system and method for the automatic recognition of spoken words or phrases in the presence of noise. Speaker-dependent speech recognition systems operate in two phases: a training phase and a recognition phase. In the training phase of a traditional speech recognition system, a user is prompted to speak all the words or phrases in a specified vocabulary. The digitized speech samples for each word or phrase are processed to produce a template of parameters characterizing the spoken words. The output of the training phase is a library of such templates. In the recognition phase, the user speaks a particular word or phrase to initiate a desired action. The spoken word or phrase is digitized and processed to produce a template, which is compared with all the templates produced during training. The closest match determines the action that will be performed. The main impairment limiting the accuracy of speech recognition systems is the presence of noise. The addition of noise during recognition severely degrades recognition accuracy, because this noise was not present during training when the template database was produced. The invention recognizes the need to account for the particular noise conditions that are present at the time of recognition to improve recognition accuracy.
Instead of storing templates of parameters, the improved speech processing system and method stores the digitized speech samples for each spoken word or phrase in the training phase. The training phase output is therefore a digitized speech database. In the recognition phase, the noise characteristics in the audio environment are continually monitored. When the user speaks a word or phrase to initiate recognition, a noise-compensated template database is constructed by adding a noise signal to each of the signals in the speech database and performing parameter determination on each of the speech plus noise signals. One embodiment of this added noise signal is an artificially-synthesized noise signal with characteristics similar to that of the actual noise. An alternative embodiment is a recording of the time window of noise that occurred just before the user spoke the word or phrase to initiate recognition. Since the template database is constructed using the same type of noise that is present in the spoken word or phrase to be recognized, the speech recognition unit can find a good match between templates, improving the recognition accuracy.