The present invention relates to an estimating method of parameters of a novel Hidden Markov Model (HMM) applicable in pattern recognition such as recognition of speech signals which are time series signals, and a method and an apparatus for pattern recognition employing the HMM. In particular, this invention may be easily applicable to time series signals such as speech. For the convenience of explanation, an example of speech recognition is explained below.
FIG. 3 is a block diagram of speech recognition using HMM. Block 301 is a speech analysis part, which using HMM. Block 301 is a speech analysis part, which converts in input speech signal into a feature vector in a specific time interval (called a frame), by a known method such as filter bank, Fourier transform, or linear prediction analysis. Therefore, the input speech signal, is converted into a feature vector series Y=(y(1), y(2), . . . , y(T)), where y(t) is a vector at time t, and T is the number of frames. Block 302 is a so-called code book, which holds representative vectors corresponding to each code in a form retrievable by the code (label). Block 303 denotes a vector quantizing part, which encodes (replaces) each vector in the vector series Y into a code corresponding to the closest representative vector registered in the codebook. Block 304 is an HMM creating part, which creates HMMs each of which corresponds to each word, to one of which an input speech is identified. That is, to make an HMM corresponding to word w, first the structure of the HMM (the number of states, and the transitional rules permitted between states) is properly determined. Then from the code series obtained by multiple utterances of the word w, the state transition probability and occurrence probability of the code occurring by transition of the state are estimated, so that the occurrence probability of the code series may be as high as possible. Block 305 is an HMM memory part, which stores the obtained HMMs for each word. Block 306 denotes a likelihood calculating part, which calculates the likelihood of each model stored in the HMM memory part 305 for a code series of unknown input speech. Block 307 is a decision part, which decides the word corresponding to the model, giving the maximum likelihood as the result of recognition.
Recognition by HMM is practically performed in the following process. Suppose the code series obtained for the unknown input is O=(o(1), o(2), . . . , o(T)), the model corresponding to the word w is .lambda..sup.w, and an arbitrary state series of length T generated by model .lambda..sup.w is X=(x(1), x(2), . . . , x(T)). The likelihood of model .lambda..sup.w for the code series 0 is as follows. ##EQU1##
Or by the logarithm, it is defined as follows: ##EQU2## where P(O, X.vertline..lambda..sup.w)refers to the simultaneous probability of O, X in model .lambda..sup.w.
Therefore, for example, by using equation (1), assuming ##EQU3## w is the resulting recognition. This is the same when using formula (2) or (3).
The simultaneous probability of O, X in model .lambda..sup.W P(O, X.vertline..lambda.) is determined as follows.
Suppose in every state of HMM .lambda.q.sub.i (i=1 to I), the occurrence probability b.sub.io(t) of code o(t) and transition probability a.sub.ij from state q.sub.i (i=1 to I) to state qj (j=1 to I+1) are given, then simultaneous probability of X and O which occur from HMM .lambda. is defined as follows: ##EQU4## where .pi..sub.x(1) is the initial probability of state x(1). Incidentally, x(T+1)=I+1 is the final state, in which it is assumed that no code occurs (actually the final state is added in most cases as in this case).
In this example the input feature vector y(t) is converted into code o(t), but in other methods, the feature vector y(t) may be directly used instead of the occurrence probability of code in each state, and the probability density function of feature vector y(t) in each state may be given. In such a case, in equation 5, instead of the occurrence probability b.sub.i,o(t) in state q.sub.i of the code o(t), the probability density b.sub.i (y(t)) of feature vector y(t) is used. Equations (1), (2), (3) may be rewritten as follows. ##EQU5##
In any method the final recognition result is w corresponding to .lambda..sup.w giving the maximum likelihood for the input speech signal Y, provided that HMM .lambda..sup.w is prepared in the range of w=1 to W for each word w.
In this prior art, the model for converting the input feature vector into code is called discrete probability distribution type HMM or discrete HMM for short. The model for using the input feature vector directly is called continuous probability distribution type HMM or continuous HMM for short.
FIG. 4 is a conceptual diagram of a discrete HMM. Block 401 denotes an ordinary Markov chain, in which the transition probability a.sub.ij from state q.sub.i to state q.sub.j is defined. Block 402 is a signal source group consisting of D.sub.1, D.sub.2, . . . , D.sub.M for generating vectors according to a certain probability distribution, and the signal source D.sub.m is designed to generate vectors to be coded into code m. Block 403 denotes a signal source changeover switch, which selects the output of signal source m according to the occurrence probability b.sub.im of code m in state q.sub.1 and the output vector of the selected signal source is observed as the output of the model. In the discrete HMM, the observed vector y(t) is converted into a code corresponding to the closest centroid, and the signal source is selected according to the converted signal series. The occurrence probability of the observation vector series Y from this HMM is then calculated regarding the probability as the probability of the above mentioned selected source series.
FIG. 5 is a conceptual diagram of a continuous HMM. Block 501 is an ordinary Markov chain of the same type as Block 401, and the transition probability from state q.sub.i to state q.sub.j is defined. Block 502 is a signal source group for generating vectors according to a certain probability distribution corresponding to the state of HMM, and signal source i is assumed to generate a vector in state q.sub.i of the HMM. Block 503 is a signal source changeover switch, which selects the output of signal source i in state q.sub.i, and the output vector of the selected signal source is observed as the output of the model.
In the continuous HMM, the occurrence degree of observation vector in each state is given by the probability density function defined therein, which required an enormous number of computations even though the recognition accuracy is high. On the other hand, in discrete HMM, the calculation of likelihood of the model corresponding to the observation code series, requires only a few computations because the occurrence probability b.sub.im of code D.sub.m (m=1, . . . , M) in each state can be obtained by reading out from the memory device previously stored for the codes. 0n the other hand, due to error associated with quantization, the recognition accuracy is impaired. To avoid this, the number of codes M (corresponding to the number of clusters) must be increased. However, the greater the number of clusters is, the greater the number of training patterns is needed to estimate b.sub.im accurately. If the number of learning patterns is insufficient, the estimated value of b.sub.im often becomes 0, and correct estimation may not be achieved.
The estimation error in the latter case is, for example, as follows. A codebook is created by converting voices of multiple speakers into a feature vector series for all words to be recognized, clustering the feature vectors, and assigning each cluster a code. Each cluster has a representative vector called the centroid. The centroid is usually the expected value of a vector classified in each cluster. In a codebook, such centroids are stored in a form retrievable by the code.
In the vocabulary as recognition unit, suppose there is a word "Osaka." Consider a case of creating a model corresponding to the word. Speech samples corresponding to "Osaka" uttered by many speakers are converted into a feature vector series. Each feature vector is compared with the centroid. The closest centroid is a quantized one of that feature vector, and the corresponding code is the coding output of that feature vector. In this way, each speech sample corresponding to "Osaka" is converted into a code series. By estimating the HMM parameter so that the likelihood for these code series may be a maximum, the discrete HMM for the word "Osaka" is established. For this estimation, the known Baum-Welch method or other similar methods may be employed.
In this case, among the codes in the codebook, there may be codes that are not contained in the learning code series corresponding to the word "Osaka." In such a case, the occurrence probability of the codes is estimated to be "0" in the process of learning in the model corresponding to "Osaka." In such a case, however, if the codes are different, before the feature vector is being converted into a code, it is very close to the speech sample used in model learning, and it may be recognized sufficiently as "Osaka" as seen in the stage of vector. Although the same word is spoken, it is possible that the input word is converted into a completely different code only by a slight difference in the state of code despite similarities in the vector stage, it is easy to see that this can adversely affect the accuracy of recognition. The greater the number of clusters M, and the smaller the number of training data, the more likely that such problems will occur.
To get rid of such a defect, it is necessary to process the codes not appearing in the training set (not included) by smoothing, complementing, or the like. Various methods have been proposed, including a method of decreasing the number of parameters to be estimated by using a method called "tied," a method of replacing the "0" probability with a minimum quantity. A method making the cluster boundary unclear such as fuzzy vector quantization. Among them, the HMM based on the fuzzy vector quantization has few heuristic elements, is clear theoretically and can be realized algorithmically; but the conventional proposal was approximate in a mathematical sense.
On the other hand, in the speech recognition of unspecified speakers, the following problems are involved. Model parameters a.sub.ij, b.sub.im, etc. are estimated as "mean values" from multiple training patterns of multiple speakers. Therefore, due to the variance based on individual differences, the expanse of spectrum for each phoneme spreads, and the spectra overlap between mutually different phonemes. It may be difficult to separate between categories. For example, the word utterance pattern of "Wakayama" by speaker A is clearly distinguished from the word utterance pattern of "Okayama" by speaker A, but may be hardly distinguished from the word utterance pattern of "Okayama" spoken by another speaker B. Such phenomenon is one of the causes making it difficult to recognize the speech of unknown speakers.