1. Field of the Invention
The present invention relates to a recognition system for recognizing a signal entered in the form of a speech, an image or the like, and more particularly to a recognition system in which Hidden Markov Models (HMMs) are used for recognition.
2. Description of the Related Art
The speech recognition of a discrete HMM scheme has been successful in recent years. In this recognition, a speech signal is converted into a predetermined code sequence by vector-quantization and recognized on the basis of similarity between the code sequence and discrete HMMs. However, the discrete HMM scheme has a defect that the rate of recognition is lowered by quantization errors which occur in the vector-quantization.
A continuous density HMM (CDHMM) scheme has been established to reduce the quantization errors. In the speech recognition of this scheme, a speech signal is recognized by using CDHMMs provided for predetermined categories (words or phonemes). The CDHMM is defined as a transition network model composed of states each having an average vector .mu.(k,s) and a covariance matrix C(k,s), where k denotes a category, and s denotes a state. Assume that the CDHMM speech recognition is applied to a ticket vending machine in which speech signals are entered to designate destination places. In this case, words such as "TOKYO", "NAGOYA", "OSAKA" and the like correspond to the categories, and phonemes "T", "O", "K", "Y", and "O" correspond to the states of a network model for "TOKYO". FIG. 1 shows a typical transition network model composed of N states S.sub.1, S.sub.2, . . . , S.sub.N. The initial state S.sub.1 is shown at the left end of the transition network model, and the final state S.sub.N is shown at the right end of the transition network model. In this network model, each state transits to a next state with a certain probability (transition probability), and a feature vector is output with a certain probability (output probability) in transition, except for null transition to the same state. Such a network model is called a "Hidden" Markov Model, since only a sequence of output feature vectors is observable.
In the CDHMM speech recognition, the model has two parameters of transition probabilities p(k,i,j) and output probabilities g(k,i,j), where
p(k,i,j):
probability of transiting a state Si to a next state Sj in a model of a category k, and PA1 probability of outputting a feature vector x in transition from the state Si to the state Sj in the model of the category k.
g(k,i,j):
If it is assumed that the same feature vector is output in a self-loop from Si to Si and in transition from Si to Sj, g(k,i,j) can be expressed as g(k,s) using a state s. For the sake of simplicity, g(k,s) is used in the description below. A speech signal is recognized by obtaining a conditional probability Pr(X.vertline.M) of each model.sub.M outputting a feature vector sequence X=xl, x2, . . . , and evaluating the obtained conditional probability.
FIG. 2 shows a conventional CDHMM speech recognition system.
In this system, a feature extractor 11 extracts a sequence of feature vectors x from an input speech. A switching section SW is switched to supply the feature vector sequence X to a CDHMM processor 12 in a recognition mode. The CDHMM processor 12 reads out average vectors .mu.(k,s) and covariance matrices C(k,s) which are provided for categories k and states s and stored in a memory section 13, and defines CDHMMs of the categories k based on the readout average vectors .mu.(k,s) and the covariance matrices C(k,s). More specifically, the CDHMM processor 12 initially calculates the following equation (1) to obtain values g(k,s) for the states of each model M. ##EQU1##
In equation (1), P(k) represents a fixed value of the probability that a category k appears, T represents a transposition, and C-1(k,s) represents an inverse matrix of C(k,s). The CDHMM processor 12 accumulates the obtained values g(k,s) along the time axis by means of a well-known Viterbi algorithm (e.g., Seiichi Nakagawa, "Speech Recognition by Probability Models", Institute of Electronic and Communication Engineers of Japan, 3.1.3-(c), PP. 44-46) to obtain a conditional probability Pr(X.vertline.M) of each model M. A discriminator 15 produces a result of recognition indicating a model M whose conditional probability Pr(X.vertline.M) is maximized.
The switching section SW is switched to supply a feature vector sequence X to a training section 14 in a training mode. The training section 14 estimates .mu.(k,s) and C(k,s) of the feature vector sequence X, which are required for determining the parameters (i.e., p(k,i,j) and g(k,s)) of the model M whose the probability Pr(X.vertline.M) is maximized. This parameter estimation can be performed by means of a well-known forward-backward algorithm (e.g., Seiichi Nakagawa, "Speech Recognition by Probability Models", Institute of Electronic and Communication Engineers of Japan, 3.3.2, PP. 69-73).
As described above, the CDHMM processor 12 performs the aforementioned processing on a feature vector sequence of an uttered speech input, and the discriminator 15 discriminates the category of a model M whose probability Pr(X.vertline.M) is maximized. The continuous density HMM scheme can achieve a higher recognition rate than the discrete HMM scheme in theory, if the covariance matrices C(k,s) have a large number of dimensions.
However, the conventional CDHMM speech recognition is not suitable for practical use, since a large quantity of training data are required for forming the large covariance matrices C(k,s) and a long processing time are required for calculating the large covariance matrices C(k,s). In order to solve these problems, a method using only diagonal elements of the covariance matrices or a mixture density HMM scheme in which a plurality of distributions are prepared with respect to feature vectors. Although these solutions can solve the problems, it fails to achieve a good recognition rate.