1. Field of the Invention
The present invention relates to a method for generating standard patterns for pattern recognition using a mixture model, and more particularly to a speech recognition apparatus using a hidden Markov model (HMM) using Gaussian mixtures (or Guassian mixed distribution) as an output probability distribution.
2. Related Art
In recent years research has been conducted with regard to machine recognition of speech patterns, and various methods have been proposed. Of these, a typical method is one using a hidden Markov model (HMM). Speaker-independent speech recognition systems, which recognize any speaker's voice, using a hidden Markov model have been the subject of active research and development.
A speech recognition system is described below using the hidden Markov model example, with reference shown in FIG. 2. The voice of a speaker input to a speech recognition apparatus is input to an input pattern generating means 101, and subjected to such processing as A/D conversion and voice analysis. Then the processed voice is converted into a time series of feature vectors on a basis of a unit of a predetermined time length called a frame.
The time series of the feature vectors is here referred to as an input pattern. The frame length is normally in the approximate range from 10 ms to 100 ms.
Feature vectors are extraction of the quantity of features of voice spectra which are in general 10 to 100 dimensions.
The HMM is stored in a standard pattern storage means 103. The HMM is one of the models of voice information sources, and parameters of which can be learned by using a speaker's voice. The method of recognition with regard to the HMM is described here in detail, and the HMM is in general prepared for each recognition unit.
Here, a phoneme is taken as an example of a recognition unit. For example, in a speaker-independent speech recognition system, a speaker-independent HMM created by learning the voices of a large number of speakers is used as the HMM in a standard pattern storage means 103.
A word HMM is used in a recognition means 104 to perform recognition of the input patterns.
HMM is a model of voice information sources which adopt a statistical idea into description of standard patterns in order to cope with variations in voice patterns.
A detailed description of the HMM is detailed in “Fundamentals of Speech Recognition”, Rabiner and Juang, 1993 Prentice Hall (hereinafter referred to as reference 1).
HMM of each phoneme is made up of 1 to 10 states and state transitions therebetween. In general, a starting state and an ending state are defined. At every unit time, symbols are output at each state to cause state transition.
Voice of each phoneme is represented as a time series of symbols output from HMM during the states transitions from the starting state to the ending state.
The occurrence probability of a symbol in each state and the transition probability between each of the states are defined.
Transition probability parameters represent temporal variations of speech patterns.
Output probability parameters represent variations of speech patterns in tone of voice.
With a probability of a starting state fixed to a certain value, by multiplying occurrence probability and transition probability at each state transition, a probability of occurrence of a speech generated from the model can be obtained.
Conversely, when a speech is observed, assuming the speech is generated from a certain HMM, its occurrence probability can be calculated. In voice recognition by HMM, HMM is prepared for each candidate recognition target, and when a speech is input, an occurrence probability is obtained in each HMM, an HMM in which the probability is the highest is determined as a generation source and a candidate recognition target corresponding to the HMM is taken as a recognition result.
As output probability parameters, there are the discrete probability distribution expression and continuous probability distribution expression, with the continuous probability distribution expression being used in the example here.
In the continuous probability distribution expression, mixed continuous distribution, that is, distribution obtained by adding a plurality of Gaussian distributions with weights, is often used.
In the following example, the output probability is expressed by a mixed continuous probability distribution.
Parameters such as output probability parameter, transition probability parameter and weights of a plurality of Gaussian distributions are learned in advance by the algorithm called Baum-Welch Algorithm, which provides a learning voice corresponding to a model.
For example, consider the case in which recognition is to be performed of 1000 words, this being the case in which a single correct word is to be determined from among 1000 word recognition candidates.
First, in the case of word recognition, the HMMs for each phoneme are linked so as to generate the HMMs for the recognition candidate words.
In the case of 1000-word recognition, the word HMMs for 1000 words are generated. An input pattern O expressed as a time series of feature vectors is represented as Equation (1) below.O=o1,o2,o3, . . . ,ot, . . . oT  (1)
In the above, T represents the total number of frames of an input pattern.
Candidate recognition target words are denoted as W1, W2, . . . , WN, where N represents the number of candidate recognition target words.
Matching between a word HMM for each word Wn and an input pattern O is carried out using the following procedure. In the following, suffix N will be omitted unless it is necessary.
First, with respect to a word HMM, transition probability from a state j to a state i is represented as ajl, mixed weight of an output probability distribution is represented as clm, a mean vector of each element Gaussian distribution is represented as μim, and covariance matrix is represented as Σlm. Here, t represents an input time, i and j represent a state of HMM and m represents a mixed element number.
Then, the following recurrence formulas with regard to the forward probability αt(i) will be operated.
This forward probability αt(i) is the probability of the state i existing at time t when the partial observed time series o1, o2, . . . , ot is output.αt(i)=π, (i=1,2, . . . ,I)  (2)
                                          α                          t              +              1                                ⁡                      (            i            )                          =                              ∑            j                    ⁢                                                    α                t                            ⁡                              (                j                )                                      ⁢                          a              ji                        ⁢                                          b                i                            ⁡                              (                                  O                                      t                    +                    1                                                  )                                      ⁢                                                  ⁢                          (                                                i                  =                  1                                ,                2                ,                …                ⁢                                                                  ,                                  I                  ;                                      t                    =                    1                                                  ,                …                ⁢                                                                  ,                T                            )                                                          (        3        )            
In the above, π represents a probability of the initial state being i.
In Equation (3) bl(Ot+1) is defined by the following equations (4) and (5).
                                          b            i                    ⁡                      (                          O              t                        )                          =                              ∑            m                    ⁢                                    c              im                        ⁢                          N              ⁡                              (                                                                            O                      t                                        ;                                          μ                      im                                                        ,                                      ∑                    im                                                  )                                                                        (        4        )            N(Ot;μim,Σlm)=(2π)−k/2|Σlm|−1/2exp(−(μlm−Ot)Σ−1(μlm−Ot)/2)  (5)
In Equation (5), K is the dimension of the input frame and the mean vector.
The likelihood of an input pattern for the word Wn is obtained by the following equation (6).Pn(X)=αT(I)  (6)
In Equation (6), I is the ending state.
This processing is performed for each word model, and a recognition result word Wn for the input pattern X is determined from the following Equation (7) (in which a hat is placed over n).{circumflex over (n)}=arg maxn Pn(X)  (7)
The recognition result word Wn is sent to the recognition result output section. The recognition result output section outputs the recognition result to a screen or outputs a control command responsive to the recognition result to other units.
The standard pattern generating means 102 is described below. In the case of speaker-independent recognition, the standard pattern generating means 102 accumulates the speech of a large number of speakers beforehand and performs parameter prediction by using these speech samples for parameter prediction.
First, a backward probability is derived from Equation (8) and Equation (9).βT(i)=1 (i=1, . . . ,N)  (8)
                                          β            t                    ⁡                      (            i            )                          =                              ∑                          j              =              1                        N                    ⁢                                          ⁢                                    a              ij                        ⁢                                          b                j                            ⁡                              (                                  O                                      t                    +                    1                                                  )                                      ⁢                          β                              (                                  t                  +                  1                                )                                      ⁢                                                  ⁢                          (                                                t                  =                                      T                    -                    1                                                  ,                                  T                  -                  2                                ,                …                ⁢                                                                  ,                                  1                  ;                                      i                    =                    1                                                  ,                …                ⁢                                                                  ,                N                            )                                                          (        9        )            
In Equation (9) βt(i) is the probability, given the time t and the state i, of a partially observed time sequence from the time t+1 up to the ending state.
Using the forward probability and the backward probability, the probability, given an observed sequence O, that state i exists at time t is given by the following equation (10).
                                          γ            t                    ⁡                      (            i            )                          =                                                            α                t                            ⁡                              (                i                )                                      ⁢                                          β                t                            ⁡                              (                i                )                                                                        ∑                              i                =                1                            I                        ⁢                                                  ⁢                                                            α                  t                                ⁡                                  (                  i                  )                                            ⁢                                                β                  t                                ⁡                                  (                  i                  )                                                                                        (        10        )            
The probability that state i exists at time t and state j exist at time t+1 is given by Equation (11).
                              ξ                      t            ⁡                          (                              i                ,                j                            )                                      =                                                            α                t                            ⁡                              (                i                )                                      ⁢                          a              ij                        ⁢                                          b                j                            ⁡                              (                                  O                                      t                    +                    1                                                  )                                      ⁢                                          β                                  t                  +                  1                                            ⁡                              (                j                )                                                                        ∑                              i                =                1                            I                        ⁢                                                  ⁢                                          ∑                                  j                  =                  1                                I                            ⁢                                                          ⁢                                                                    α                    t                                    ⁡                                      (                    i                    )                                                  ⁢                                  a                  ij                                ⁢                                                      b                    j                                    ⁡                                      (                                          O                                              t                        +                        1                                                              )                                                  ⁢                                                      β                                          t                      +                      1                                                        ⁡                                      (                    j                    )                                                                                                          (        11        )            
In the case of a mixed Gaussian distribution, the probability that a k-th state i exists in the mixed elements at time t (the occupying frequency) is given by the following Equation (12).
                                          γ            ′                    ⁡                      (                          i              ,              k                        )                          =                                                                              α                  t                                ⁡                                  (                  i                  )                                            ⁢                                                β                  t                                ⁡                                  (                  i                  )                                                                                    ∑                                  i                  =                  1                                I                            ⁢                                                          ⁢                                                                    α                    t                                    ⁡                                      (                    i                    )                                                  ⁢                                                      β                    t                                    ⁡                                      (                    i                    )                                                                                ×                                                    c                ik                            ⁢                              N                ⁡                                  (                                                            O                      t                                        ,                                          μ                      ik                                        ,                                          ∑                      ik                                                        )                                                                                    ∑                                  m                  =                  1                                M                            ⁢                                                          ⁢                                                c                  im                                ⁢                                  N                  ⁡                                      (                                                                  O                        t                                            ,                                              μ                        im                                            ,                                              ∑                        im                                                              )                                                                                                          (        12        )            
Based on the foregoing equations, the prediction values π, α, μ, Σ, and c are given by Equations (13) through (17).
                              π          _                =                              γ            1                    ⁡                      (            i            )                                              (        13        )                                                      a            _                    ij                =                                            ∑                              t                =                1                                            T                -                1                                      ⁢                                                  ⁢                          ξ              t                              i                ,                j                                                                        ∑                              t                =                1                                            T                -                1                                      ⁢                                                  ⁢                          γ              t              i                                                          (        14        )                                                      c            _                    jk                =                                            ∑                              t                =                1                            T                        ⁢                                                  ⁢                                          γ                t                ′                            ⁡                              (                                  j                  ,                  k                                )                                                                        ∑                              t                =                1                            T                        ⁢                                                  ⁢                                          γ                t                            ⁡                              (                j                )                                                                        (        15        )                                                      μ            _                    jk                =                                            ∑                              t                =                1                            T                        ⁢                                                  ⁢                                                            γ                  t                  ′                                ⁡                                  (                                      j                    ,                    k                                    )                                            ⁢                              O                t                                                                        ∑                              t                =                1                            T                        ⁢                                                  ⁢                                          γ                t                ′                            ⁡                              (                                  j                  ,                  k                                )                                                                        (        16        )                                                      ∑            _                    jk                ⁢                  =                                                    ∑                                  t                  =                  1                                T                            ⁢                                                          ⁢                                                                    γ                    t                    ′                                    ⁡                                      (                                          j                      ,                      k                                        )                                                  ⁢                                  (                                                            O                      t                                        -                                          μ                      jk                                                        )                                ⁢                                                      (                                                                  O                        t                                            -                                              μ                        jk                                                              )                                    t                                                                                    ∑                                  t                  =                  1                                T                            ⁢                                                          ⁢                                                γ                  t                  ′                                ⁡                                  (                                      j                    ,                    k                                    )                                                                                        (        17        )            
In the Baum-Welch algorithm, parameters are updated based on these prediction values, and the updated parameters are used to further and repeatedly predict the prediction values.
At each iteration, it has been proven that the probability of performing recognition of the observed sequence increases.
The foregoing is an example of using the HMM, which was used in the past.
As described above, there are discrete distribution expression and continuous distribution expression as representations of output probability.
Of the two distribution expressions, the continuous distribution expression, and the mixed Gaussian distribution expression in particular is often used.
The reason for using the mixed Gaussian distribution is that it provides superior performance in expressing the output probability distribution.
In the case of using the mixed Gaussian distribution (herein referred to simply as the mixed distribution) there is no clear-cut guide as to how many element distributions should be made.
In a mixed distribution HMM, it is usual to take the number of element distributions for each state as being constant for all states, and to test with different numbers of element distributions, selecting the number of element distributions that has the best performance from thereamong.
However, it can be expected that the required number of element distributions will differ, depending upon the state.
For example, if an unnecessarily large number of element distributions are made, this leads to an increase in the amount of calculation required to calculate the probability of the element distribution.
For a state having a low probability of occurrence, in the process of parameter prediction there is a possibility of a deterioration of performance with regard to unknown data, for which over-learning is done.
Therefore, it is desirable that the number of element distributions at each state of a mixed distribution HMM be optimized for each state.
The simplest method of optimizing the number of element distributions for each state is that of performing recognition experiments as the number of element distributions is changed for each state, and selecting the number of element distributions with the highest recognition performance for each state.
Because the overall number of HMM states is very great, this being usually from 1000 to 10000 the optimization of the number of element distributions for each state is virtually impossible from the standpoint of the amount of calculation that is required.
Accordingly, it is an object of the present invention, given the above-described background, to provide a speech recognition apparatus which performs adjustment of the number of element distributions effectively and at a high speed in a probability model using a mixed distribution.