The invention relates to automatically grouping observed events into similar categories and correlating the groups of observed events with different contexts for use in automatic pattern recognition. More specifically, the invention relates to automatically grouping utterances of a speech unit (for example, utterances of a phoneme) into categories having similar sounds (that is, categories having similar acoustic features), and correlating the groups of utterances with different contexts, for use in automatic speech recognition.
In continuous speech, there is known to exist a co-articulation effect. Co-articulation is the utterance of two or more speech units (for example, two words or two phonemes) with little or no pause between the words or phones. The co-articulation effect is the variation in the pronunciation of a speech unit in dependence on the other speech units with which it is co-articulated. Prior attempts to manually enumerate a set of co-articulation rules have been prone to large errors.
In automatic speech recognition, the acoustic features of an utterance of an unknown sequence of one or more speech units are "fit" to a number of hypothesis models of hypothesis sequences of speech units. A sequence of speech units may be, for example, a string of one or more words. The hypothesis model of the sequence of speech units is constructed by concatenating the models of the words in the sequence.
While it would simplify the process of automatic speech recognition to provide only a single model for each word in the vocabulary, recognition accuracy can be improved by providing a different model of each word for each alternative pronunciation of the word.
An automatic method of generating co-articulation rules and generating word or phoneme models according to these rules is described in U.S. patent application Ser. No. 323,479, filed on Mar. 14, 1989, assigned to the assignee of the present application. In this method, multiple utterances of a phoneme in different contexts are processed to produce, for each utterance, (i) a string of labels representing the acoustic features of the uttered phoneme, and (ii) a representation of the context of the uttered phoneme (i.e. a representation of the context phonemes which precede and which follow the uttered phoneme). The label strings for the uttered phoneme are clustered into acoustically similar categories, so that each utterance of the phoneme is associated with a cluster number and a context.
From the multiple utterances, a number of candidate contexts are identified. For each candidate context and its complement, the conditional probabilities of each cluster are estimated. From the conditional probabilities, conditional entropies are calculated for each candidate context and its complement. The conditional entropies represent the average information obtained about the cluster number from the context of an uttered phoneme. The candidate context and its complement associated with the optimum entropies are selected as the basis for the best split of the multiple utterances. Each set of utterances resulting from this split is either further split in the manner described above, or is used as the basis of forming a context-dependent model of the phoneme.
The known method of enumerating co-articulation rules generates a number of different context-dependent models for each phoneme. The use of these context-dependent models according to the co-articulation rules increases recognition accuracy as compared to the use of a single model for each phoneme. However, the clustering step used in the known method requires a large amount of computation time. Moreover, by basing the entropy calculations (the splitting criteria) on the cluster numbers, information about how different two clusters are is lost. Two or more very dissimilar clusters may be combined into a single subset of the split so long as the total entropy of the split is reduced.