1. Field of the Invention
The present invention relates to a speech recognition method, and more particularly to a speech recognition method manifesting a high rate of recognition without requiring learning with a particularly large quantity of training data.
2. Description of the Prior Art
Speech signals are expressed in time series patterns of feature vectors, and speech recognition is based on the degree of identity between a reference pattern representing a known piece of speech and the pattern of the input speech signal. For these time series patterns, the Hidden Markov Model (HMM) is extensively used as described in detail in the specifications of the U.S. Pat. Nos. 4,587,670 and 4,582,180. The HMM itself will not be explained in detail here because its detailed description can be found in S. E. Levinson, "Structural Method in Automatic Speech Recognition", Proc. IEEE, 73, No 11 1985, pp 1625-1650, besides said U.S. Patents.
The HMM assumes that the time series of feature vectors are generated by the Markov probability process. The standard patterns of the HMM are represented in a plurality of statuses and transitions between the statuses, and each status outputs a feature vector according to a predetermined distribution of probability density while each transition between statuses is accompanied by a predetermined probability of transition. The likelihood, which represents the degree of matching between the input pattern and a reference pattern, is given by the probability of the Markov probability model to generate a series of input pattern vectors. The probability of transition between statuses and the parameter to define the function of probability density distribution, which characterize each reference pattern, can be determined with the Baum Welch algorithm using a plurality of sets of vocalization data for the training purpose.
However, the Baum Welch algorithm, which is a statistical learning method, requires a large quantity of training data to determine the parameters of the model corresponding to reference patterns. Therefore, the load of vocalization is extremely great when a speech recognition apparatus begins to be newly used, and this presents a serious obstacle to the practical use of such apparatuses. Therefore, with a view to reducing this load, a number of speaker-adaptive methods have already been proposed to adapt a speech recognition apparatus to the speaker with a relatively small quantity of training data.
A speaker-adaptive method defines the similarity of acoustic events according to reference patterns corresponding to known speech signals and a new speaker's vocalization data for adaptation, basically using the physical distance between feature vectors as the scale, and carries out adaptation by estimating, on the basis of that similarity, the parameters of the model corresponding to acoustic events absent in the vocalization data for adaptation.
However, such a method of adaptation based on an estimation relying solely on physical distances, though providing a somewhat higher rate of recognition than before the adaptation, is far less effective in recognition than a method using reference patterns corresponding to a specific speaker, consisting of a large quantity of speech data. (For further details, see K. Shikano, K. F. Lee and R. Reddy, "Speaker Adaptation through Vector Quantization", Proc. ICASSP-86, Tokyo, 1986, pp. 2643-2646.)
Meanwhile, as means for improving the rate of recognition, the inventors of the present invention proposed a pattern recognition method based on the prediction of the aforementioned time series patterns. Using multilayer perceptrons (MPL's) based on a neural network as predictive means for the time series patterns, the outputs of the MLP's constitute reference patterns. The inventors named the reference patterns the "neural prediction model" (NPM). This NPM will not be described in detail here as its detailed explanations can be found in K. Iso and T. Watanabe, "Speaker-Independent Word Recognition Using a Neural Prediction Model," Proc. ICASSP-90, New Mexico, 1990, p. 441-444 and the pending U.S. Ser. No. (07-521625). In the NPM described in these references, a predictor (MLP) in the nth status of a reference pattern model consisting of a finite status transition network calculates a predicted for the feature vector of the input patterns at time t from a plurality of feature vectors at time t-1 and before. The distance between this predicted vector and the feature vector of the input pattern at time t is supposed to be the local distance between said two feature vectors. In the NPM described in the above cited references, the squared distance or the like between the vectors is used as this local distance.