1. Field of the Invention
The present invention relates to an HMM generator, HMM memory device, likelihood calculating device and recognizing device for a novel HMM (Hidden Markov Model) that is applicable to such pattern recognitions as speech recognition.
2.Related Art of the Invention
Although an HMM is generally applicable to the time-series signal processing field, for convenience of explanation, it is described below in relation to, for example speech recognition. A speech recognizing device using HMM will be described first. FIG. 1 is a block diagram of a speech recognizing device using HMM. A speech analyzing part 201 converts input sound signals to feature vectors at a constant time interval (called a frame) of, for example, 10 msec, by means of a conventional method such as a filter bank, Fourier transformation and LPC analysis. Thus, the input signals are converted to a feature vector series Y=(y(1), y(2), xe2x80x94xe2x80x94xe2x80x94, y(T)), where T is a number of frames. A codebook 202 holds labeled representative vectors. A vector quantizing part 203 substitutes labeled representative vectors corresponding to the closest presented vectors registered in the codebook 202 for respective vectors of the vector series Y. An HMM generating part 204 generates HMM corresponding to words that constitute a recognition vocabulary from training data. In other words, to generate HMM corresponding to a word v, an HMM structure (the number of states and transition rules applicable between the states) is firstly appropriately designated, and a state transition probability in the models and incidence probability of labels occurring in accordance with the state transition are developed from a label system obtained by a multiplicity of vocalizations of the word v such that the incidence probability of the label series is maximized. An HMM memory part 205 stores the HMM thus obtained for the words. A likelihood calculating part 206 calculates the likelihood of respective models stored in the HMM memory part 205 to the label series thereof. A comparison and determination part 207 determines, as a recognition result, words corresponding to models that provide the highest value of likelihood of the respective models.
More specifically, the recognition by HMM is performed in the following manner. When a label series O obtained for an unknown input is assumed to be O=(o(1), o(2), xe2x80x94xe2x80x94xe2x80x94o(T)), a model xcex corresponding to a word v, xcexv, a given state series of length. T generated by the model xcexv, is X=(x(1), x(2), xe2x80x94xe2x80x94xe2x80x94, x(T)), the likelihood of xcexv to the label series O is defined as:
[Exact Solution]                                          L            1                    ⁡                      (            v            )                          =                              ∑            x                    ⁢                      P            ⁡                          (                              O                ,                                  X                  |                                      λ                    v                                                              )                                                          [                  formula          ⁢                      xe2x80x83                    ⁢          1                ]            
[Approximate Solution]                                          L            2                    ⁡                      (            v            )                          =                              max            x                    ⁢                      [                          P              ⁡                              (                                  O                  ,                                      X                    |                                          λ                      v                                                                      )                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          2                ]            
or logarithmically as:                                           L            3                    ⁡                      (            v            )                          =                              max            x                    ⁢                      [                          log              ⁢                              xe2x80x83                            ⁢                              P                ⁡                                  (                                      O                    ,                                          X                      |                                              λ                        v                                                                              )                                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          3                ]            
where P(x,y|xcexv) is a joint occurrence probability of x, y in model xcexv.
Therefore, in the following expression using formula 1, for example;                               v          ^                =                              argmax            v                    ⁡                      [                                          L                1                            ⁡                              (                v                )                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          4                ]            
V{circumflex over ( )} is a recognition result. Formulae 2 and 3 can be used in the same manner.
P(O, X|xcex) can be obtained in the following manner.
When an incidence bi(o) of label o and a transition probability aij from a state qi (i=1xcx9cI) to state qj (j=1xcx9cI+1) are given by state qi for qi (i=1xcx9cI) of HMM, a simultaneous probability of coincidence of state series X=(x(1), x(2), xe2x80x94xe2x80x94xe2x80x94,s(T)) and label series O=(o(1), o(2), xe2x80x94xe2x80x94xe2x80x94,o(T)) from HMM xcex is defined as:                               P          ⁡                      (                          O              ,                              X                |                λ                                      )                          =                              π                          x              ⁡                              (                1                )                                              ⁢                                    ∏                              t                =                1                            T                        ⁢                                          a                                                      x                    ⁡                                          (                      t                      )                                                        ⁢                                      xe2x80x83                                    ⁢                                      x                    ⁡                                          (                                              t                        +                        1                                            )                                                                                  ⁢                                                ∏                                      t                    =                    1                                    T                                ⁢                                                      b                                          x                      ⁡                                              (                        t                        )                                                                              ⁡                                      (                                          o                      ⁡                                              (                        t                        )                                                              )                                                                                                          [                  formula          ⁢                      xe2x80x83                    ⁢          5                ]            
where xcfx80x(1) is an initial probability of state x(1). Incidentally, x(T+1)=I+1 is a final state, and it is assured that no label is generated there.
In the example, although an input feature vector y is converted to a label, the feature vector y can be directly used alternatively for the incidence probability of vector y in each state, and in this case the probability density function of vector y can be given for each state. In this case, a probability density bi(y) of the feature vector y is used in place of the incidence probability bi(o) of the label o in the state qi(hereinafter), when z is assumed to be a label, bi(z) defines a probability generated with z in a state of i, and, when z is assumed to be a vector, and bi(z) defines a probability density of z. In this case, the formulae 1, 2 and 3 are expressed as:
[Exact Solution]                                          L            1            xe2x80x2                    ⁡                      (            v            )                          =                              ∑            x                    ⁢                      P            ⁡                          (                              Y                ,                                  X                  |                                      λ                    v                                                              )                                                          [                  formula          ⁢                      xe2x80x83                    ⁢          6                ]            
[Approximate Solution]                                          L            2            xe2x80x2                    ⁡                      (            v            )                          =                              max            x                    ⁢                      [                          P              ⁡                              (                                  Y                  ,                                      X                    |                                          λ                      v                                                                      )                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          7                ]            
or logarithmically as;                                           L            3            xe2x80x2                    ⁡                      (            v            )                          =                  max          ⁡                      [                          log              ⁢                              xe2x80x83                            ⁢                              P                ⁡                                  (                                      Y                    ,                                          X                      |                                              λ                        v                                                                              )                                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          8                ]            
Thus, in any of the methods, when HMM xcexv is prepared for each word v, where v=1xcx9cV, a final recognition result for an input sound signal, Y is:                               v          ^                =                              argmax            v                    ⁡                      [                          P              ⁡                              (                                  Y                  |                                      λ                    v                                                  )                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          9                ]            
or                               v          ^                =                              argmax            v                    ⁡                      [                          log              ⁢                              xe2x80x83                            ⁢                              P                ⁡                                  (                                      Y                    |                                          λ                      v                                                        )                                                      ]                                              [                  formula          ⁢                      xe2x80x83                    ⁢          10                ]            
where Y is, of course, an input label series, feature vectorial series and the like according to the respective methods.
In such conventional examples, a method of converting input feature vectors to labels is hereinafter referred to as a discrete probability distribution HMM, and another method of using input feature vectors as they are as a continuous probability distribution HMM. Features of these are described below.
It is an advantage of the discrete probability distribution HMM that the number of calculations is fewer when calculating likelihood of a model to an input label series, because the incidence probability bi(Cm) of a label in state i can be run by reading from a memory device which prestores the incidence probabilities in relation to the labels, but recognition accuracy is inferior and therefore creates a problem due to errors associated with quantization. In order to prevent this problem, it is necessary to increase the number of labels(the number of clusters) although the number of learning patterns required for learning the models accordingly becomes significant. If the number of learning patterns is insufficient, bi(Cm) may frequently be 0, and correct estimation cannot be obtained. For example, the following case may occur.
In the preparation of a codebook, speeches vocalized by multiple speakers are converted to a feature vector series for all words to be recognized, the set of feature vectors are clustered, and the clusters are respectively labeled. Each of the clusters has its representative vector called a centroid, which is generally an expected value of the vectors classified to the clusters. A codebook is defined as the centroids stored in a form retrievable by the labels.
Now, it is assumed that a word xe2x80x9cOsakaxe2x80x9d, for example, is present in the recognition vocabulary, and a model corresponding to it is prepared. Voice samples corresponding to the word xe2x80x9cOsakaxe2x80x9d that are vocalized by multiple speakers are converted to a feature vector series, each of the feature vectors is compared with the centroid, and the label corresponding closest to the centroid is choosen as the Vector Quantized value of the feature vector. In this way, the voice samples corresponding to the word xe2x80x9cOsakaxe2x80x9d are converted to a label series. By estimating an HMM parameter from the resultant label series in such a manner that the likelihood to the label series is maximized, a model corresponding to the word xe2x80x9cOsakaxe2x80x9d is obtained. For the estimation, a method known as the Baum-Welch algorithm can be used.
In this case, some of the labels in the codebook might not be included in the learning label series corresponding to the word xe2x80x9cOsakaxe2x80x9d. The incidence probability of such labels that are not included therein is assumed to be xe2x80x9c0xe2x80x9d during the learning process. Therefore, it is very likely that labels not included in the label series used for modeling of the word xe2x80x9cOsakaxe2x80x9d may be present in a label series to which vocalization of the word xe2x80x9cOsakaxe2x80x9d is converted in recognition. In such a case, the incidence probability of a label series of the word xe2x80x9cOsakaxe2x80x9d vocalized in recognition from the model of the word xe2x80x9cOsakaxe2x80x9d comes to be xe2x80x9c0xe2x80x9d. Even in such a case, however, sometimes a vocalization is different in the label state, but relatively close to a voice sample used in model learning in the feature vector state before it is converted to a label, and sufficient to be recognized as xe2x80x9cOsakaxe2x80x9d in terms of the vectorial state. Such a problem is possibly caused, even if it is similar at the vectorial level, since even though originally the same word is vocalized, it can be converted, at the labeling level, to an absolutely different label because of a slight difference, and it is easily predicted that this adversely affects recognition accuracy. As the number of clusters is increased, and the number of training data is decreased, such a problem is more frequently caused.
In order to eliminate such a problem, smoothing and interpolation is required for labels that are not shown (included) in a training set. Although various methods are suggested for the smoothing and interpolation as a means of reducing the number of parameters by using a concept called xe2x80x9ctied statesxe2x80x9d, and methods of substituting for a probability, if it is estimated to be 0, a small amount without estimating it as being 0, and equivocating boundaries of clusters such as fuzzy vector quantization, none of them can fundamentally solve the problem. In addition, there are such elements that must be empirically determined for particular cases, and no theoretical guideline for the determination of such elements is suggested.
On the other hand, in the continuous probability distribution HMM, a distribution profile for each state is given beforehand in the form of a function such as a normal distribution, and parameters defining the function are estimated from the learning data. Therefore, the number of parameters to be estimated is fewer, a parameter can be estimated accurately through learning patterns fewer than those of the discrete type, smoothing and interpolation are eliminated, and it is reported that a recognition ratio higher than that of the discrete type is generally obtained.
For examples, when the number of parameters is compared between the discrete and continuous types in HMM of a 4 state and 3 loop arrangement as shown, for example in FIG. 4, the following result is recognized. In the case of a discrete type, if 256 types of labels are used, 256xc3x973=768 for incidence probability of labels and 6 for transition probability result. Thus, 774 in total are required for one model. In the case of a continuous type, if it is a 10 dimensional normal distribution, 10xc3x973=30 for an average vector, 55xc3x973=165 for a variance-covariance matrix (because of symmetric matrix) and 6 for a transition probability result. Thus a total of 201 are required, and therefore the number of parameters to be estimated is approximately {fraction (1/4+L )} or less in the continuous type compared with the discrete type.
However, the fact that the number of calculations is significantly increased in the continuous type in comparison with the discrete type, although the continuous type is superior in recognition accuracy a problem is still created. In other words, if an input feature vector y(t) has an average vector xcexci in the state i and a normal distribution of a variance-covariance matrix xcexa3i for calculating incidence probability (density) of y(t) in state i, a calculation of (y(t)xe2x88x92xcexci)Txcexa3i31 1(y(1)xe2x88x92xcexci) is required. Thus, in the case of a 10 dimensional HMM continuous type, for example, only 110 multiplications are required for this calculation, and it is multiplied by (the number of states times the number of input frames) for one model. Therefore, when the number of input frames of the model is estimated to be 50, the number of multiplications required in the calculation of (y(t)xe2x88x92xcexci)Txcexa3ixe2x88x921(y(1)xe2x88x92xcexci) is 110xc3x973xc3x9750=16500. Then, if the number of words is 500, it is multiplied by 500. That means that 8,250,000 multiplications are required solely for this portion.
In the case of the discrete type, after calculations for quantizing vectors are completed, it is only required to read an incidence probability of the label from the memory device according to the label as described. In the above example, calculations required for vector quantization of y(t) are those of distance or similarity between 256 representative vectors and y(t). If the distance is (Euclidean distance)2, the calculations required for the labeling of y(t) are 256 times 10 subtractions, 10 multiplications and 10 additions. Therefore, in the case of 50 frames, for multiplication only, 10xc3x97256xc3x9750=128000 calculations should be performed. If the vector quantization is performed according to a method called binary search, the figure 256 is replaced by 2 log2256=16, and the number of calculations comes to 10xc3x9716xc3x9750=8000.
Accordingly, the number of calculations is remarkably reduced by the discrete type, and the number of calculations is increased proportionally as the number of recognition words is increased in the continuous type, such calculations are required only once for vector quantization of input sound signals in the discrete type, and the number of calculations is unchanged even when the recognition words are increased.
In summary, the discrete type has a problem in recognition accuracy, although the number of calculations is fewer, while the continuous type has a problem in the number of calculations, though the recognition accuracy is superior.
Hence, in the light of such problems in conventional HMM, it is an object of the present invention to provide an HMM generator, HMM memory device, likelihood calculating device and recognizing device capable of providing superior recognition accuracy and reducing the number of calculations.
An HMM generator of the present invention comprises:
vector quantizing means for quantizing vectors of a training pattern having a vectorial series, and converting the vectors into a label series of clusters to which they belong,
continuous distribution probability density HMM generating means for generating a continuous distribution probability density HMM from a quantized vector series corresponding to each label of the label series, and
label incidence calculating means for calculating incidence of the labels in each state from the training vectors classified in the same clusters and the continuous distribution probability density HMM.
A likelihood calculating device of the present invention comprises:
the above mentioned vector quantizing means for converting the vector series to a label series by substituting labels for vectors of a feature vector series that constitutes an input pattern, and
likelihood calculating means for calculating, from a state transition probability and label incidence stored in the HMM memory device likelihood of HMM described by parameters stored in the HMM memory device to the input pattern.
In an HMM generator according to the present invention, vector quantizing means quantizes vectors of a training pattern having a vector series, and converts the vectors to a label series of clusters to which they belong, continuous distribution probability density HMM generating means generates a continuous distribution probability density HMM from a quantized vector series corresponding to each label of the label series, and incidence of a label in each state is calculated from the training vectors classified in the same clusters and the continuous distribution probability density HMM.
Additionally, in a likelihood calculating device according to the present invention, a vector series is converted to a label series by substituting labels for vectors of a feature vector series constituting a pattern by an input of a vector quantizing means, and the likelihood of the HMM to an input pattern is calculated from label incidence in each state of the HMM generated by the HMM generator.