1. Field of the Invention
The present invention relates to a pattern recognition scheme in which feature parameters (feature vectors) of each recognition category are modelled by probabilistic models, and a recognition of an input data is realized by obtaining a probability of each probabilistic model with respect to an input feature vector sequence. The Pattern recognition scheme can be used for automatic speech recognition, automatic character recognition, automatic figure recognition, etc.
2. Description of the Background Art
A pattern recognition scheme using probabilistic models based on the probabilistic and statistical method is a useful technique in the pattern recognition of speech, character, figure, etc. In the following, the prior art using the hidden Markov model (abbreviated hereafter as HMM) will be described for an exemplary case of the speech recognition.
In the conventional speech recognition, the method for modelling each recognition target speech unit (phoneme, syllable, word, etc.) by using the HMM in advance is the mainstream of the present-day speech recognition. FIG. 1 shows an exemplary conventional speech recognition apparatus using HMM. In this conventional speech recognition apparatus of FIG. 1, a speech entered from an input terminal 21 is converted into digital signals by an A/D conversion unit 22. Then, speech feature vectors are extracted from the digital signals at a speech feature vector extraction unit 23. Then, the HMMs generated for each speech unit (phoneme, syllable, word, etc.) to be recognized in advance are read out from an HMM memory 24, and a probability of each model with respect to the input speech is calculated at a model probability calculation unit 25. Then, a speech unit represented by a model with the highest probability is outputted from a recognition result output unit 26 as a recognition result.
In the statistical (or probabilistic) model such as HMM, there is a trade-off between the degree of freedom of a model (a model's expressive power or a number of parameters expressed by a model) and an amount of training data.
Namely, if the degree of freedom is increased while an amount of training data is small, the model would express some features which are not essential for the pattern recognition. As a result, a recognition error would be caused even for those data which are only slightly different from the training data. For example, suppose that upon observing the training data it was revealed that a certain region of data is missing so that there is a valley in a part of the probability distribution. When an amount of training data is small, it is highly likely that this valley is caused because data in that region were not observed. Consequently, a higher recognition precision can be realized by lowering the degree of freedom of the model and applying the smoothing according to the surrounding data, rather than expressing this valley precisely by using a high degree of freedom model. On the other hand, if the degree of freedom of the model is low despite of a large amount of data available, it would not be possible to obtain a sufficient expressive power and realize a high recognition performance. Therefore, there is a need to express the statistical model by an ample degree of freedom in view of the amount of training data.
Conventionally, in the field of speech recognition, the models with a rather low degree of freedom have been used because the amount of training speech data has been insufficient. But, in recent years, in conjunction with the expansion of the available amount of training data, there is a trend for generating models with a higher recognition performance by using a higher degree of freedom. For example, there is an acoustic model which is built by the speech data acquired from ten thousand persons.
However, no technique for dealing with such a changed state of matter regarding an amount of training data has been available so far, and the only available technique has been a simple extension of the conventional technique (such as an increase of a number of parameters). Therefore there has been a demand for a new modelling method that can deal with a case where the amount of training data is abundant.
FIG. 2 shows an example of the HMM in three states. A model like this is generated for each speech unit (recognition category, that is, phoneme, syllable, word or sentence). To states S1 to S3, the probabilistic distributions D1 to D3 of the speech feature vectors are assigned respectively. For example, when this is the phoneme model, the first, second and third states express the probabilistic distributions of the feature vectors around a starting point of a phoneme, around a center of a phoneme, and around an ending point of a phoneme, respectively.
The present invention relates to a method for expressing the distribution for the feature vectors (feature parameter distribution) in each state, so that the prior art techniques regarding such an expression method will now be described.
FIG. 3 shows an exemplary feature parameter distribution. For the sake of simplicity, it is assumed that the feature vector is expressed by a two-dimensional vector. A region in which feature vectors of a training data for a certain recognition category exists is shown as a shaded region. In addition, the feature parameter distribution of FIG. 3 actually has a three-dimensional distribution shape in which a portion with many feature vectors appears as a mountain, so that what is expressed by the entire distribution shape appears as a range of mountains with a plurality of peaks. In practice, the feature vectors have 30 or so dimensions and the distribution shape is very complicated.
One example of a method for expressing the distribution is a discrete distribution expression based on the vector quantization, which will now be described with reference to FIG. 4. In FIG. 4, the distribution is expressed by arranging representative points (vector quantization points, indicated by dots in FIG. 4) in a feature vector space (represented as a two-dimensional space in FIG. 4), and changing a probability value (weight coefficient) for each representative point. At a time of recognition, the probability value is obtained by calculating a distance for the purpose of checking a quantization point that is closest to an input feature vector.
When the distribution for each recognition category is expressed by using a common quantization point set for all the recognition categories and changing only probability values, it suffices to carry out a calculation for finding the quantization point closest to the input feature vector only once. However, as shown in FIG. 5, even when four vector quantization points share a part of their elements in the dimension-1 and dimension-2, for example, these four vector quantization points must be treated independently so that the efficiency of the expression is poor. At a time of training, it is necessary to count a number of training data allocated to each quantization point, so that there is a problem that a huge number of quantization points must be arranged and a large amount of training data must be used in order to realize an accurate distribution expression. In addition, there is also a problem that an error between the input feature vector and the vector quantization point allocated thereto can lower the accuracy.
Another example of a method for expressing the distribution is a continuous distribution expression based on the multi-dimensional diagonal Gaussian distribution, which will now be described with reference to FIG. 6. In FIG. 6, the distribution in the multi-dimensional space is expressed in a product space of the Gaussian distributions of respective dimensions, using Gaussian distributions as peripheral distributions and assuming that there is no correlation among dimensions. The Gaussian distribution is a parametric distribution that can be expressed by a mean and a variance, so that it is possible to expect an effect of smoothing the distribution shape and providing a limited degree of freedom. However, the distribution expressed in the product space is a distribution which is parallel to an axis of each dimension, so that there is a problem that the distribution as shown in FIG. 3 cannot be expressed. Also, the Gaussian distribution is a single peak distribution, so that one distribution can only express one peak.
In order to resolve this problem, there is an expression method using a mixture distribution of the multi-dimensional diagonal Gaussian distributions (which will be referred to as a continuous mixture distribution hereafter), which will now be described with reference to FIG. 7. In FIG. 7, the distribution to be expressed is divided into a plurality of regions and each region is expressed by the Gaussian distribution. This is the currently most popular method. The recognition scheme utilizing the multi-dimensional continuous mixture distribution is disclosed in U.S. Pat. No. 4,783,804, for example.
However, even in this method, when the distribution to be expressed has a very complicated shape and many peaks, there is a problem that it is necessary to arrange at least as many distributions as a number of peaks to be expressed. An increase in a number of distributions leads to an increase in an amount of calculation.
An output probability b.sub.i (o.sub.t) for an input feature vector o.sub.t =(o.sub.t1, o.sub.t2 ,. . . , o.sub.tP) (where P is a total number of dimensions) of a time t in the mixture Gaussian distribution type HMM of a state i can be given by: ##EQU1## where w.sub.i,m is a weight coefficient for the m-th multi-dimensional Gaussian distribution of a state i. The probability density for the multi-dimensional Gaussian distribution m is given by: ##EQU2## where .mu..sub.i,m is a mean vector of the m-th multi-dimensional Gaussian distribution of a state i, .SIGMA..sub.i,m is a covariance matrix of the m-th multi-dimensional Gaussian distribution of a state i, and T denotes a transpose of a matrix. Assuming that the covariance matrix only has diagonal components (a diagonal covariance matrix), the log of .phi..sub.i,m (o.sub.t) is given by: ##EQU3## where .mu..sub.i,m,p is the p-th component of the mean vector of the m-th multi-dimensional Gaussian distribution of a state i, and .sigma..sub.i,m,p.sup.2 is the p-th diagonal component (variance) of the covariance matrix of the m-th multi-dimensional Gaussian distribution of a state i.
This calculation is carried out for the feature vector of each time of the input speech, with respect to the recognition candidate models, and the recognition result is outputted according to the obtained log probability.
There is also a method which uses the scalar quantization in order to reduce a calculation time for the continuous mixture distribution (see, M. Yamada et al.: "Fast Output Probability Computation using Scalar Quantization and Independent Dimension Multi-Mixture", Proc. of ICASSP96, pp. 893-896). In this method, after the continuous mixture distribution type model is trained, a plurality of Gaussian distributions are combined into one distribution in each dimension, and the discrete distribution expression based on the scalar quantization is obtained, as shown in FIG. 8. However, the combined distribution in each dimension is nothing but a discrete expression of the original continuous mixture distribution, and the distribution shape remains at the same or the lower accuracy level.
In addition, the product space is accounted after the distributions are combined in each dimension, so that the combined distribution will also cover those regions not belonging to the training data used for the distribution estimation, and this can cause a lowering of the recognition performance.
As described, the conventionally used Gaussian distributions are appropriate in expressing relatively simple distribution shapes. But, in recent years, in conjunction with the expansion of the training speech database, there is a need to express more complicated distribution shapes in order to obtain more accurate models. Since the Gaussian distribution has a limited degree of freedom, it is necessary to use many mixture component distributions in order to express a detailed distribution shape. For this reason, there is a problem that a number of the mixture component distributions M in the above equation (1) becomes large and an amount of calculation for the output probability is increased.
For example, when a model based on 4 mixture distributions is upgraded to a model based on 32 mixture distributions, an amount of calculation will be increased 8 times larger, even though the recognition precision can be improved. Even in a typical example of the conventional speech recognition apparatus, a time required for the output probability calculation consumes 45% to 60% of the total processing time of the speech recognition, and this already time consuming processing will become even more computationally loaded processing when a number of mixture distributions increases. Thus there is a problem that the increased number of mixture distributions hinders the realization of the real time processing, despite of the fact that the real time processing is an important factor from a viewpoint of its use as a human interface,