Phonologists have attempted to find the smallest set of sound units, called phonemes, sufficient to distinguish among different utterances. Each phoneme is represented by a symbol. This symbol is called a phone. For instance, /p/ and /b/ are distinct phonemes of English, because they distinguish such words as pin and bin from each other. However, it should not be thought that acoustic intervals labeled by the same phoneme would necessarily sound alike. The acoustic variants of a given phoneme are called allophones. Different sounds may be allophones of the same phoneme if at least one of two conditions is met that prevents them from distinguishing utterances. Two allophones either never occur in the same sound environment (such as the aspirated word initial p of pot and the unaspirated final p of top) or if they do, the substitution of one for the other does not produce a different word, but merely a different pronunciation of the same word.
A phonetic alphabet must be enlarged to lend itself to convenient pattern recognition by an acoustic processor. The region of the recognition space used that corresponds to the acoustic variants of a given phoneme will have to be partitioned into a minimal number of compact, convex subregions, each subregion to be labelled by a different symbol, referred to as sub-phone The union of all of these subregions forms a convex region which is represented by a symbol called a phone. Since the design of the structure of the speaker production model and of the acoustic processor should be guided by phonetic experience, it will be desirable to keep the partitioning system such that the obtained subregions are made up of more or less traditional perception units. For a more detailed discussion see Continuous Speech Recognition by Statistical Methods Frederick Jelinek, Proc. of the IEEE, Vol 64, No. 4, pp 532-556 (April 1976), herein incorporated by reference.
Traditional speech recognition systems have used Hidden Markov Models (HMM's) to represent a phoneme or to represent a label. See U.S. Pat. No. 4,819,271 to Bahl et al., herein incorporated by reference. Context-dependent speech modeling systems typically utilize Hidden Markov Models. Hidden Markov Models are well known in the field of speech recognition. In general a Hidden Markov Model is a sequence of probability distributions, states and arcs. Arcs are also called transitions. An observation vector is an output of the Hidden Markov Model. Associated with every arc is a probability distribution, e.g., a gaussian density. The probability distribution are distributions on a series of observation vectors that are produced by an acoustic processor. When performing Hidden Markov Model speech recognition, one essential step is characterizing each word in a vocabulary as a sequence of Hidden Markov Models. Depending upon the model, each Hidden Markov Model represents either an entire word or a phoneme.
A Markov Model speech recognition system typically includes an acoustic processor which converts a speech input into a string of labels. The labels in the string are assigned to the output of the acoustic vector by utilizing a set of predefined prototypes where each prototype corresponds to a cluster of vectors in n-dimensional space that defines all speech. Based upon the values of the n characteristics, an n-component acoustic parameter vector is defined. An acoustic parameter vector is one type of feature vector. As discussed supra speech is categorized into convex regions. Each convex region has a prototype vector associated with it. A prototype vector is a representative vector for the convex region. A selection is made as to which convex region a given acoustic parameter vector belongs in. In general, when determining which convex region an acoustic parameter vector should be associated with, the acoustic parameter vector is compared to the prototype vector associated with each label. As stated earlier, each convex region is identified with a respective label. For each interval of time, typically a centisecond, the acoustic processor generates a signal representing an acoustic parameter vector; the convex region into which the acoustic parameter vector belongs is then determined; and the label for that convex region is associated with the time interval. The acoustic processor thus produces a string of labels as its output. Context-independent speech recognition systems model a given phoneme individually. Context-dependent speech recognition systems model a given phoneme utilizing the preceding and/or following phoneme. In order to adequately model context-dependent speech, significantly more convex regions are necessary than when context-independent speech is modelled.
Context-independent label prototype vectors are determined using an individual phoneme. The preceding or following phoneme are not considered during the development of these context-independent label prototype vectors. However, when words are spoken, a particular phoneme actually varies depending upon the previous phoneme and/or the following phoneme. The articulation of a sound may vary substantially when articulated in context compared to the articulation of the sound in isolation. Thus, depending on what sounds precede and follow a phoneme, the pattern of energy concentrations of a phoneme will change. Therefore, creating label prototype vectors which account for neighboring phonemes improves modelling of continuous speech. Accounting for the phonetic context results in context-dependent label prototype vector signals. Each phoneme has many variations depending upon the neighboring phone. The combination of a target phoneme and neighboring phonemes is called the phonetic context of the target phoneme.
Another type of Hidden Markov model based speech recognition system relies on arc ranks as derived from context-dependent arc prototypes. In general, this type of system reduces an inputted speech signal into signals representing a sequence of continuous valued acoustic parameter vectors by an acoustic processor. Then each arc assigns a conditional probability to each acoustic parameter vector. A rank processor then sorts these conditional probabilities and outputs the rank of each acoustic parameter vector based upon these conditional probabilities. Such a system is termed to be based upon arc ranks.
In a Hidden Markov Model arc rank speech recognition system training and recognition are performed in terms of ranks and not in terms of acoustic labels. In order to model speech with greater accuracy than other systems, context-dependent arc prototypes are used. Using arc ranks obviates the need of a labeller when training or recognizing a speaker. However, developing a recognizer requires the use of acoustic labels for some purposes such as the automatic creation of Hidden Markov Model word models. Therefore a labeller that can operate on context-dependent prototypes is a necessity.
In order to label a frame of speech using context-dependent label prototype vector signals it is desirable to know the exact phonetic context of the frame. In practice this can never be known exactly, but it can be estimated from a Viterbi alignment. In general, the Viterbi alignment aligns each label with its corresponding phone. Since training data is used, the phonetic context of each phone is known. Therefore the Viterbi alignment results in the alignment of the labels with its phonetic context. The Viterbi alignment is further defined below. It is often impractical to Viterbi align large quantities of speech on the basis of ranks. A more efficient method of alignment uses acoustic labels. The problem is that acoustic labels are needed in order to compute a Viterbi alignment, and an alignment is needed in order to determine the labels, which are based upon context-dependent label prototype vector signals. What is needed is a method to resolve this mutual dependency.