Human speech is increasingly used as the input form for data, instructions, commands, and other information inputted into communication and data processing systems. Speech input can be used, for example, to conduct and record transactions electronically, to request and relay information electronically, and, to provide command and control for various types of electronic communication and/or data processing systems. The use of human speech as the input provides considerable mobility and flexibility in the use of all types of electronic communication and data processing systems, especially system where the use of peripheral devices such as a keyboard is awkward or inconvenient.
Direct input of speech into electronic systems requires that human speech signals be converted into a machine readable form. Such conversion can be done with conventional speech recognition systems that typically convert voice-induced signals into a sequence of phonetically-based recognition features using spectral analysis of speech segments or a sequence of feature vectors based on linear prediction characteristics of the speech. Such features reflect the various characteristics of the human voice such as pitch, volume, length, tremor, etc.
These speech-derived features provide an acoustic signal of the word to be recognized. The acoustic signal can be compared against an acoustic description or model of phonemes stored electronically in a database to obtain a statistically significant match. For example, each phoneme in the database whose pitch closely matches that of the particular segment of the inputted utterance can be found. Then, to narrow the search for a match, the tremor of each phoneme can be compared to the segment of the inputted utterance. The process can continue until a match having a desired confidence level is obtained.
In many speech recognition systems, for example, the acoustic signal is converted by an A/D converter into a digital representation of the successive amplitudes of the audio signal created by the underlying speech and then converted into a frequency domain signal consisting of a sequence of frames, each of which provides the amplitude of the speech signal in each of a plurality of frequency bands. The sequence of frames produced by the speech to be recognized is compared with a sequence of nodes, or frame models, corresponding to the acoustic model.
Accordingly, a sequence of phonemes based on the underlying speech input is obtained. This sequence is then compared to phoneme groupings corresponding to speech segment comprising one or more sentences, a phrase, or an individual word.
A language model can also be used to reduce the computational demands and increase the likelihood of a correct match. The particular language model typically predicts the relative likelihood of the occurrence of each word in the speech recognition system vocabulary given other words that have been identified in connection with the specific speech utterance. These predictions are based on the fact that the likelihood of a given word having been spoken is a function of its context as expressed by the other words in a sentence or segment of speech. The likelihoods can be determined, for example, by analyzing a large body of text and determining from that text the number of times that each word in the vocabulary is preceded by each other word in the vocabulary. Diagram language models, for example, give the likelihood of the occurrence of a word based on the word immediately preceding. Trigram language models, similarly, base likelihood on the occurrence of the two immediately preceding words.
When the speech recognition system cannot identify a match, the speaker can be requested by the system to choose the correct word from a list of candidate words. If the speech recognition system makes an incorrect match, the speaker can be provided an opportunity to correct the choice. Such selection and/or changes can be fed back into the speech recognition system and stored in a reference table to improve the accuracy of the system.
In conventional speech recognition systems, the acoustic and language models used are typically specific to the language of the system. As already described, however, the recognition of a word depends on the translation of aural patterns into discrete, recognizable features representative of physical phenomena such as pitch, volume, and tremor. Accordingly, the accuracy of the speech recognition system depends critically on how well the speaker articulates the words he or she speaks. This, in turn, depends significantly on whether and to what extent the speaker speaks with an accent. It is this factor that frequently makes distinguishing and accurately recognizing the speech of non-native speakers of the language of the system highly problematic.
Proposed solutions to this problem include installing speech recognition engines for different languages and requesting that the user specify which language he or she would like to use. This solution is not viable, however, in every instance in which a non-native speaker is using a speech recognition system. For example, the system user may wish to use a particular language despite being accented in another language. Even more problematic are those situations in which the user interface of the speech recognition system is only available in a particular language and the user has no recourse but to use that language. For example, because of the user's accent, the recognition accuracy may be so low as to render the user interface inoperable for a heavily accented user.
Another proposed solution is to independently create an entirely new system incorporating pronunciation variants to attempt to improve the accuracy of the speech recognition system. Such a solution, however, may have only a limited ameliorative effect on accuracy and, in any event, is certain to increase the costliness of providing such a speech recognition system. Moreover, a system using a vast number of accent combinations is likely to be infeasible owing to the considerable portion of a processor's memory that would likely need to be allocated in order to store the acoustic and language models associated with such a system.