I. Field of the Invention
The present invention relates in general to the recognition of speech probabilistically by representing words as respective sequences of Markov models in which labels--of a defined alphabet of labels--represent Markov model outputs.
II. Description of the Problem
In language processing, it is common for a human phonetician to segment words into sequences of phonetic elements--the phonetic elements being selected from the International Phonetic Alphabet. Typically, the phonetician listens to a word and, based on his/her expertise, matches successive portions of the word with respective phonetic elements to determine a phonetic spelling of the word.
Such phonetic sequences have been provided in standard dictionaries. Also, however, phonetic sequences have been applied to speech recognition in general and to Markov model speech recognition in particular.
In the case of Markov model speech recognition, the various phonetic elements are represented by respective Markov models. Each word then corresponds to a sequence of phonetic Markov models.
FIG. 1 is a diagram depicting a sample Markov model which can represent a phonetic element. It is observed that the sample phonetic element Markov model includes seven states S1 through S7 and thirteen arcs (or transitions) which extend from a state to a state. Some arcs are simply loops extending from an arc back to itself while the other arcs extend from one arc to another. During a training session, a known word sequence is uttered and a probability for each arc in each Markov model is determined and stored.
Some arcs--referred to as "null arcs"--are depicted with dashed lines. Non-null arcs are shown with solid lines. For each non-null arc, there are a plurality of label output probabilities associated therewith. A label output probability is the probability of a given label being produced at a given non-null arc in a given Markov model. These probabilities are also determined during the training session.
In the recognition process, the Markov models are employed in conjunction with an acoustic processor. The acoustic processor, in brief, receives a speech input and processes successive intervals of speech based on pre-defined parameters. Sample parameters have, in the past, included energy amplitudes at various frequency bands. Treating each parameter characteristic (e.g. the amplitude at each frequency) as a vector component, the collection of amplitudes represents a vector in speech space. The acoustic processor stores a plurality of predefined prototype vectors having prescribed vector component--or parameter--values and assigns a label to each prototype vector. For each of successive intervals, a vector (referred to as a "feature vector") is generated by the acoustic processor 202 in response to an uttered input. Each component of a feature vector corresponds to the amplitude of a respective one of the parameters for a given interval. For each time interval, the label for the prototype vector which is "closest" to the feature vector is selected. For each interval, then, a label is generated by the acoustic processor.
The labels generated by the acoustic processor are the same labels which can be produced as label outputs along the arcs of the Markov models. After arc probabilities and label output probabilities are assigned during the training session, a procedure may be followed to determine the likelihood of a certain Markov model or sequence of Markov models--which corresponds to a "word baseform"--given a particular string of labels generated by the acoustic processor. That is, given that labels f.sub.1 f.sub.2 f.sub.3 . . . have been generated by the acoustic processor for successive intervals of speech, the likelihood of proceeding through each path of a Markov model (or sequence of models) and producing the string of generated labels can be determined. Performing this calculation for a word in a vocabulary provides a measure of that word's likelihood.
The accuracy of a Markov model speech recognizer is greatly dependent on (a) a proper selection of the parameter values for the labels in the alphabet of labels and (b) a proper selection of the Markov model structure and the statistics applied to the arcs. It has been found that reliance on human phoneticians to define the constituent parts of a word results in an arbitrariness in modelling, a lack of uniformity in the word baseforms (i.e., sequences of constituent parts which form a word), and accuracy levels which are not satisfactory.
In addition, the use of an alphabet of labels with fixed parameter values, which depends on some predefined clustering algorithm, has resulted in less than optimal recognition.