This invention relates to a method and system that uses models of triphones, diphones, and phoneroes to enable a machine to recognize spoken words, phrases, or sentences.
Speech recognition is an emerging technology with important applications in human-machine interfaces. Much recent work in this field has been directed toward statistical techniques based on stochastic modeling, using hidden Markov models.
A hidden Markov model (HMM) comprises a finite number of states with specified transition probabilities among the states. Each state is assumed to produce an observable value, referred to herein as a label, from a finite set of labels, which have different probability distributions for different states. In the context of speech recognition the labels represent classes of spectral features of audio waveforms. The model is hidden in the sense that while the labels are observable, the states that produce them are not.
An HMM speech recognizer has a dictionary of hidden Markov models corresponding to the words, phrases, or sentences to be recognized (referred to below as the target vocabulary). Given an utterance, the reeognizer reduces the utterance to a label sequence, calculates the probability of that label sequence in each of the models in its dictionary, selects the model giving the highest probability, and thus recognizes the corresponding word, phrase, or sentence.
The problem is to construct time dictionary. There exist algorithms for using sample data to train an HMM. One approach is to obtain spoken samples of the entire target vocabulary and train an IMM for reach item in the target vocabulary. A drawback of this approach is that it becomes impractical if the target vocabulary is very large. A second drawback is that when the target vocabulary is expanded, the entire training procedure must be repeated for every new target word, phrase, or sentence.
Another approach is to start by making a dictionary of phonemes, containing one HMM for each phoneme in the target language. These phoneme HMMs can be concatenated to form HMMs of arbitrary words, phrases, or sentences, and in this way a dictionary of the entire target vocabulary can be constructed. This approach suffers, however, from poor accuracy, because when phonemes are concatenated in speech, co-articulation or consecutive phonemes tends to disort their spectral features, so that they are recognized incorrectly.
Yet another approach is therefore to start from larger units such as diphones (two consecutive phonemes) or triphones (three consecutive phonemes), obtain HMMs of these, then assemble them into HMMs of words, phrases, or sentences. One problem in this approach is, again, the large amount of training data required. Even the Japanese language, with its comparatively simple phonetic structure, has more than six thousand triphones; obtaining and processing training data for the full set of triphones would be a prohibitively time-consuming task. Another problem is discontinuities that occur when, for example, triphone HMMs are assembled to create word HMMs.