Speech recognition is defined as the process allowing humans to interact with machines by using speech. Scientists have worked for years to develop the capability for machines to understand human speech. The applications of this capability are obvious. People can interface with machines through speech, as opposed to the cryptic command inputs that are the norm with today's personal computers, telephony devices, embedded devices and other programmable machinery. For example, a person who wants to access information from a telephone may need to listen to multiple prompts and navigate through a complex phone system by pressing keys on a keypad or matching predefined keywords to get adequate information retrieved. This time-consuming process frustrates, and even sometimes discourages the user, and increases the cost for the information provider.
The most common approach to speech recognition relates to sound analysis of a digitized audio sample, and the matching of that sound sample to stored acoustic profiles representative of pre-defined words or utterances. Techniques for such matching include the Hidden Markov Model (HMM) and the Backus-Naur (BNF) techniques, both well known in the art. Typically, current techniques analyze audio streams and identify one single most probable phoneme per time-slice, while introducing a probabilistic bias for the following time-slice to recognize a single most probable phoneme. A successful “match” of an audio sample to an acoustic profile results in a predefined operation to be executed. Such techniques typically force users to adapt their behavior by limiting their vocabulary, forcing them to learn commands that are recognized by the system or having them react to prompts taking significant time before the information of interest to them is communicated.
One of the greatest obstacles to overcome in continuous speech recognition is the ability to recognize words when uttered by persons having different accents and/or voice intonations. For example, many speech recognition applications cannot recognize spoken words that do not match the stored acoustic information due to particular pronunciation of that word by the speaker. Often users of speech recognition programs must “train” their own speech recognition system by reading sentences or other materials to permit the machine to recognize that user's pronunciation of words. Such an approach cannot be used, however, for the casual user of a speech recognition system, since spending time to train the system would not be acceptable.
Several approaches involve the use of acoustical models of various words to identify words in digitized audio data. For example, U.S. Pat. No. 5,033,087 issued to Bahl et. al. and titled “Method and Apparatus for the Automatic Determination of Phonological Rules as For a Continuous Speech Recognition System,” the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure, discloses the use of acoustical models of separate words in isolation in a vocabulary. The system also employs phonological rules which model the effects of coarticulation to adequately modify the pronunciations of words based on previous words uttered.
Similarly, U.S. Pat. No. 5,799,276 issued to Komissarchik et. al. and titled “Knowledge-Based Speech Recognition System and Methods Having Frame Length Computed Based Upon Estimated Pitch Period of Vocalic Intervals,” the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure, discloses an apparatus and method for translating an input speech signal to text. The apparatus segments an input speech signal based on the detection of pitch period and generates a series of hypothetical acoustic feature vectors that characterize the signal in terms of primary acoustic events, detectable vowel sounds and other acoustic features. The apparatus and method employ a largely speaker-independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions. Word choices are selected by comparing the generated acoustic event transcriptions to the series of hypothesized acoustic feature vectors.
Another approach is disclosed in U.S. Pat. No. 5,329,608 issued to Bocchieri et. al. and titled “Automatic Speech Recognizer,” the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure. Bocchieri discloses an apparatus and method for generating a string of phonetic transcription strings from data entered into the system and recording that in the system. A model is constructed of sub-words characteristic of spoken data and compared to the stored phonetic transcription strings to recognize the spoken data.
Yet another approach is to select candidate words by slicing a speech section by the unit of a word by spotting and simultaneously matching by the unit of a phoneme, as disclosed in U.S. Pat. No. 6,236,964 issued to Tamura et. al. and titled “Speech Recognition Apparatus and Method for Matching Inputted Speech and a Word Generated From Stored Reference Phoneme Data,” the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure.
As previously noted, several approaches use Hidden Markov Model techniques to identify a likely sequence of words that could have produced a given speech signal. For example, U.S. Pat. No. 5,752,227 issued to Lyberg and titled “Method and Arrangement for Speech to Text Conversion,” the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure, discloses identification of a string of phonemes from a given input speech by the use of Hidden Markov Model techniques. The phonemes are identified and joined together to form words and phrases/sentences, which are checked syntactically.
Typically, in prior art approaches, too much emphasis is put on straight sound recognition instead of recognizing speech as a whole, where syntax is used exclusively to build a concept and the concept itself is used in order to produce an adequate response.