Many speech recognition systems operate by comparing acoustic models of known words, or parts of words, against acoustic descriptions of speech to be recognized. They recognize the speech by identifying with it those words, or parts of words, whose models compare most closely with its description. In many such systems, digital signal processing converts the analog signal of an utterance to be recognized into a sequence of frames. Each such frame is a group of parameters associated with the analog signal during a brief period of time. Commonly the parameters represent the amplitude of the speech at each of a plurality of frequency bands. Such speech recognition systems commonly compare the sequence of frames produced by the utterance against a sequence of acoustic models, each of which has a model of the parameter values of frames associated with a given speech sound.
An example of such a speech recognition system is given in U.S. patent application Ser. No. 797,249, filed by Baker et. al. on November 12th, 1985, for "Speech Recognition Apparatus and Method" (hereinafter referred to as "application Ser. No. 797,249"). Application Ser. No. 797,249 is assigned to the assignee of the present application and is incorporated herein by reference. It discloses a system in which each vocabulary word is represented by a sequence of statistical node models. Each such node model is a multidimensional probability distribution, each dimension of which represents the probability distribution for the values of a given frame parameter if its associated frame belongs to the class of sounds represented by the node model. Each dimension of the probability distribution is represented by two statistics, an estimated expected value, or mu, and an estimated absolute deviation, or sigma.
A method for deriving statistical models of the basic type discussed in application Ser. No. 797,249 is disclosed in U.S. patent application Ser. No. 862,275, filed by Gillick et al. on May 12th, 1986, for "A Method For Representing Word Models For Use In Speech Recognition" (hereinafter referred to as "application Ser. No. 862,275"). Application Ser. No. 862,275 is assigned to assignee of the present application, and is incorporated herein by reference. It discloses how to divide multiple utterances of the same word into corresponding groups of frames, called nodes, which represent corresponding sounds in the different utterances of the word, and to derive a statistical model of the type described above for each such node. In addition application Ser. No. 862,275 discloses how to divide the nodes from many words into groups of nodes with similar statistical acoustic models, and how to calculate a statistical acoustic model for each such cluster. The model for a given cluster is then used in place of the individual node models from different words which have been grouped into that cluster, greatly reducing the number of models which have to be stored. The use of such clusters also greatly reduces the number of words a new user has to speak in order to train up a large vocabulary speech recognition system, since the user is only required to speak enough words to train a model for each cluster, rather than being required to separately speak, and train up a model for, each word in the recognition system's vocabulary.
Although the methods of deriving and using statistical node and cluster models described in application Ser. Nos. 797,249 and 862,275 work well, it is still desirable to improve their performance. One problem with such statistical models is that the mu's and sigma's used for their parameters are only estimates. The estimated mu and estimated sigma for a given parameter of such a model are derived from a finite number of samples of the frames associated with the node or cluster which the models represent. The mu of a given parameter is the mean of that parameter over all frames used to calculate the model, and the sigma is the absolute deviation of that parameter over all such frames.
In order for such estimates to be accurate, they should be based on a very large sampling of acoustic data. Unfortunately, it is often undesirably expensive or undesirably inconvenient to obtain a large enough sampling of the speech sounds associated with the model to make its mu's and sigma's as accurate as desirable. It would normally take many thousands of utterances of each node or cluster to derive truly accurate statistics for its model. But the requirement of so many training utterances would make the initial use of speech recognition systems much more difficult. For this reason such systems normally operated with models derived from insufficient data, causing the performance of such systems to suffer.
The statistical inaccuracy of acoustic models has other undesirable effects. For example, when acoustic models of nodes, that is, successive parts of individual words, are divided into clusters of the nodes with similar acoustic models, as described in application Ser. No. 862,275, inaccuracies in the statistics of node models due to insufficient sampling data increases the chance that individual nodes will be put into the wrong clusters, or will be put into separate clusters all by themselves when they should not be. This makes the clustering process less accurate and less efficient.
Such statistical inaccuracies can also causes problems when cluster models of the type described in application Ser. No. 862,275, which are derived by having an end user speak each cluster in a small number of words, are used to recognize the sound represented by that cluster in a much larger number of words. This results both from the small number of frames from which each cluster is derived and from the fact that speech sounds are often varied by the context in which they are said, and thus that the statistics of a cluster model tend to be less accurate when they are derived from utterances of a small number of words than when they are derived from utterances of all the words in which the cluster is used.