1. Field of the Invention
This invention relates to speech recognition systems. Specifically, this invention relates to a novel system and method that enhances the accuracy of speech recognition systems by dynamically switching between reference patterns corresponding to training information produced under different ambient noise levels.
2. Description of Related Art and General Background
Speech recognition systems afford users the capability of performing various tasks on recognition-enabled apparatuses via verbal commands. FIG. 1A (Prior Art) is a high-level functional block diagram depicting a conventional speech recognition system 100. As indicated in FIG. 1, system 100 comprises apparatus 105 and a sound capturing device 115 (e.g., microphone). Apparatus 105 includes a speech recognition processing mechanism 110 for analyzing and processing sounds captured by device 115 and for generating an identified utterance signal ui. Apparatus 105 also includes a statistical speech model 120 comprising a set of reference patterns, and related applications 125 for performing predetermined tasks ti. It is to be noted that apparatus 105 may take the form of a computer, telephone, or any device capable of recognizing and processing verbal commands and executing tasks based on those commands.
FIG. 1B is a high-level flow diagram depicting the general operation of system 200, denoted as process 150. As indicated in FIG. 1B, the sounds or utterances captured by device 115 are received by speech recognition processing mechanism 110 in analog form in block B155. In block B160, mechanism 110 samples and digitizes the analog utterances and assembles the digitized utterances into frames. In block B165, mechanism 110 then extracts acoustical information from the utterance frames by employing any of a number of well-known techniques, including Linear Predictive Coding (LPC) and Filter Bank Analyses (FBA).
In block B170, process 150 endeavors to xe2x80x9crecognizexe2x80x9d the speech captured device 115 by having mechanism 110 compare the extracted acoustical information to a set of reference patterns stored in speech model 120. The reference patterns comprise a plurality of utterances to be recognized. As such, mechanism 110 determines the best match between the extracted acoustical information and reference patterns in order to identify the utterance received by mechanism 110. In performing the comparisons, mechanism 110 may employ a host of well-known statistical pattern matching techniques, including Hidden Markov Models, Neural Networks, Dynamic Time Warped models, Templates, or any other suitable word representation model. It is to be noted that the plurality of utterances comprising the reference patterns are based, at least in part, on speech training information produced during a training mode. Typically, in training mode, users recite a variety of selected verses into device 115 in order to acclimate mechanism 110 to the user""s voice, prior to using system 100. To this end, the selected verses are designed to make the user articulate a wide range of sounds (e.g., diphones, phonemes, allophones, etc.).
Based on the results of the comparison, mechanism 110, in block B175, generates an identified utterance signal ui, indicating the best match between the utterance received by mechanism 110 and the stored reference patterns. Mechanism 110 then supplies signal ui to applications 125 to perform the predetermined tasks ti.
As noted in FIG. 1A, speech training is performed in the presence of ambient noise level n, and thus, the utterances comprising the stored reference patterns are affected by ambient noise. Given the reliance of the reference patterns on the speech training information, system 100 is particularly susceptible to the contextual nature of speech training. For example, suppose apparatus 105 is a portable computer equipped with applications 125, configured to convert speech into text for word-processing tasks, and mechanism 110, trained within a relatively serene environment (i.e., office). Once removed from the serene environment into a noisier environment, such as, for example, an airplane, mechanism 110 may suffer a significant decrease in accuracy and fidelity. The reasons for such decrease in performance may be two-fold. One reason may be that the ambient noise level n is so high that the sounds captured by the sound capturing device include a blend of speech and background noise, thus making it difficult to distinguish between the two.
Another reason, perhaps more common, is the fact that individuals have a tendency to manipulate their voices so as to ensure that the speech produced is understandable in the presence of substantial ambient noise. In doing so, individuals may, unwittingly, pronounce words with different phonological characteristics (e.g., level, inflections, stress, pitch, and rhythm) than normally produced during quieter conditions. As such, the performance of speech recognition processing mechanism 110, trained and acclimated to a user""s pronunciations under certain conditions, may be adversely affected when mechanism 110 operates under different conditions.
Therefore, what is needed is a system and method that dynamically switches between reference patterns based on training information produced under different ambient noise levels to enhance speech recognition accuracy.