I. Field of the Invention
The present invention relates to speech processing, such as speech recognition, in which each of a plurality of vocabulary words is to be represented and stored in a computer memory as a word baseform constructed of a sequence of Markov models.
II. Description of the Problem
In speech recognition, the use of Markov models has been suggested. In performing Markov model speech recognition, one essential step is characterizing each word in a vocabulary as a respective sequence of Markov models.
In the prior art, each Markov model normally represents a phoneme, or phonetic element. A human phonetician, based on his/her expertise and senses, defines each vocabulary word as a respective sequence of phonetic elements. The Markov models associated with the sequential phonetic elements are concatenated to form a phonetic word baseform. In FIG. 1, a phonetic word baseform 100 is shown for the word "THE" to include a sequence of three phonetic Markov models: a first for the phonetic element DH, a second for the phonetic element UH1, and a third for the phonetic element XX. An International Phonetic Alphabet lists standard phonetic elements.
Each of the three phonetic Markov models are shown having an initial state and a final state and a plurality of states in between, and a plurality of arcs each of which extends from a state to a state. During a training stage, a probability is determined for each arc and for non-null arcs (represented with solid lines) label output probabilities are determined. Each label output probability corresponds to the likelihood of a label being produced at a given arc when the arc is followed. In earlier Markov model speech recognizer systems, such as that described in the co-pending, allowed patent application entitled "Speech Recognition System" by Bahl et al., Ser. No. 845,155 filed Mar. 27, 1986 now U.S. Pat. No. 4,718,094, issued Jan. 5, 1988--which is commonly owned with the present application and is incorporated by reference--each word in the vocabulary is represented by a sequence of phonetic Markov models like those illustrated in FIG. 1. During recognition, an acoustic processor generates a string of labels in response to a speech utterance. Based on the various paths the string of labels can take through the sequence of phonetic Markov models for each word and the probabilities of following arcs and producing labels thereat, the likelihood of the Markov model sequence for each word producing the string of labels is determined.
There are a number of problems with the phonetic Markov model approach. First, the sequence of phonetic Markov models for each word is greatly dependent on the expertise and senses of the phonetician. From one phonetician to another, the sequence of Markov models associated with a given word may vary. Second, the Markov model associated with a phonetic element is relatively complex. Computations required in recognizing speech based on the phonetic Markov models can be considerable. And third, the accuracy of recognizing uttered words based solely on phonetic Markov models is not optimal.
A partial solution to the above-noted problems includes performing an approximate acoustic match to all words in order to produce a short list of candidate words. Each of the candidate words is then processed in a detailed acoustic match. By reducing the number of words that must be processed in detail, computational savings are achieved. This approach has been discussed in the aforementioned patent application entitled "Speech Recognition System".
To enhance accuracy and to address the phonetician-dependence problem, recognition of speech based on a different type of Markov model has been suggested. To illustrate the different type of Markov model, it is observed that a Markov model speech recognition system typically includes an acoustic processor which converts an acoustic waveform (speech input) into a string of labels. The labels in the string are selected from an alphabet of labels, wherein each label corresponds to a defined cluster of vectors in an r-dimensional space which defines all speech. For each interval of time, the acoustic processor examines r--on the order of twenty--characteristics of speech (e.g., energy amplitudes at twenty respective frequency bands). Based on the values of the r characteristics, an r-component "feature vector" is defined. A selection is made as to which of plural exclusive clusters (for example 200 clusters) the feature vector belongs in. Each cluster is identified with a respective label. For each interval of time, a feature vector is generated by the acoustic processor; the cluster into which the feature vector belongs is determined; and the label for that cluster is associated with the time interval. The acoustic processor thus produces as output a string of labels.
The aforementioned different type of Markov model relates to labels rather than phonetic elements. That is, for each label there is a Markov model. Where the term "feneme" suggests "label-related", there is a fenemic Markov model corresponding to each label.
In speech recognition based on fenemic Markov models, each word is represented by a sequence of fenemic Markov models in the form of a word baseform. For a string of labels generated by an acoustic processor in response to a speech utterance, the fenemic Markov model sequence for each word is matched thereagainst to determine word likelihood.
Because labels are not readily discernible as are phonetic elements, constructing a word baseform of fenemic Markov models is not readily performed by a human. Instead, fenemic word baseforms are constructed automatically by computer. A simple approach is for a speaker to utter each word once and generate a string a labels by the acoustic processor. For successive labels in the string for a given word, the respective fenemic Markov models corresponding thereto are appended in sequence to form a fenemic Markov model baseform for the given word. Hence, if labels L1-L5-L10 - - - L50 are uttered, the fenemic Markov models F.sub.1 F.sub.5 F.sub.10 - - - F.sub.50 form the fenemic Markov model word baseform. This type of baseform is referred to as a "singleton baseform." The singleton baseform is not particularly accurate because it is based on only a single utterance of the subject word. A poor pronunciation of the word or a word which is subject to varying pronunciations renders the singleton baseform especially unsatisfactory.
To improve on the singleton baseform, a word baseform constructed from multiple utterances of a subject word has been proposed. Apparatus and methodology therefor is described in the co-pending parent application. In that application, word baseforms are constructed which are not only more accurate because based on multiple utterances, but also the word baseforms are constructed automatically without human intervention.
The parent application mentions that baseforms for word segments, as well as whole words per se, may be derived from multiple utterances according to that invention.