The present invention relates to automatic speech recognition, and more particularly to an efficient system and method for continuous speech recognition of vocabulary words. A source for examples of the prior art and prior art mathematical techniques is Delaney, D. W. “Voice User Interface for Wireless Internetworking,” Qualifying Examination Report,” Georgia Institute of Technology; School of Electrical and Computer Engineering; Atlanta, Ga. Jan. 30, 2001.
Automatic speech recognition is an important element of wireless connectivity. Pocket-sized devices having small screens and no keyboard will be enabled by speech technology to allow users to interact with systems in a natural manner. Similarly, automatic speech recognition is necessary for an autoattendent system providing hand-free telephone calling in which a user requests that a telephone number be dialed for a person whose name is spoken by the user. While this application is but one of many for the present invention, it invokes many issues to be addressed in automatic speech recognition. The automatic speech recognition unit must include a vocabulary. In the present example, the vocabulary comprises names of people to be called. Known techniques for automatic speech recognition create stochastic models of word sequencing using training data. Then P(O|W) is estimated. This is the probability that that a particular set of acoustic observations, O corresponds to a model of a word W.
An important technique for deriving correlation of particular spoken sounds to models is the Hidden Markov Model. The Hidden Markov Model is provided to operate on outputs from audio circuitry which grabs a sample of N frames for a given sound. A language is resolved into phonemes, which are the abstract units of a phonetic system of a language that correspond to a set of similar speech sounds which are perceived to be a distinctive sound in the language. The apparatus detects phones from the samples of N frames. A phone is the acoustic manifestation of one or more linguistically-based phonemes or phoneme-like items. Each known word includes one or more phones.
Qualitatively, the decoder may be viewed as comparing one or more recognition models to features associated with an unknown utterance. The unknown utterance is recognized by the known words associated with the recognition model with which the test pattern most closely matches. Recognition model parameters are estimated from static training data stored during an initial training period.
The Hidden Markov Model (HMM) can best be described as a probabilistic state machine for the study of time series. In speech recognition, the time series is given by an observation vector O. The observation vector O=(O1O2, . . . OT) where each Oi is an acoustically meaningful vector of speech data for the “i”th frame. HMMs are Markov chains whose state sequence is hidden by the output probabilities of each state. An HMM with N states is indexed as {s1, s2, . . . , sN}. A state, sk contains an output probability distribution B which describes the probability that a particular observation is produced by that state. B can be either discreet or continuous. The HMM has an initial state distribution, π, which describes the probability of starting in any one of the N states. For convenience in notation, the entire HMM can be written as λ=(ABπ). Speech recognition is primarily interested in the probability P(O|λ). The results of such decoding are not certain. The result could be response to out of vocabulary words (OOVs) or another misrecognition. Such a misrecognition will generate the wrong telephone call. Practical systems must try to detect a speech recognition error and reject a speech recognition result when the result is not reliable. Prior systems have derived rejection information from acoustic model level data, language model level data and parser level data. Such data requires a good deal of processing power, which increases expense of practical implementations and adds difficulty in achieving real time operation.