The present invention relates to a method, apparatus and computer program for preparing an acoustic model used for speech recognition. More particularly, the present invention relates to a method, apparatus and computer program for preparing an acoustic model adapted to the voice of a person using speech recognition and the environment in which speech recognition is used.
In recent years, it has been expected to use speech recognition technology for improvement of the convenience of the user of digital information equipment such as cellular phones, personal digital assistants (PDAs), car navigation systems, personal computers and home electric appliances.
In a speech recognition system, if no acoustic model is appropriate to a user, the user is prevented from using the speech recognition system. Therefore, it is necessary for the speech recognition system to provide an acoustic model adapted to the voice of a user. There are various techniques for adapting an acoustic model to the voice of a person using a speech recognition system (speaker adaptation techniques) as shown in FIG. 1. FIG. 1 is a map of various speaker adaptation techniques placed at positions corresponding to the levels of the computer power and the hard disk capacity of the system considered necessary to realize the respective speaker adaptation techniques. The map also includes, for the respective speaker adaptation techniques, “the number of sentences the user must utter for adaptation”, “variation factors acceptable by the speaker adaptation technique (speaker individuality, voice tone)”, and “recognition performance (indicated by the size of the asterisk; as the asterisk is larger, the performance is higher)”.
Conventionally, because information equipment was low in computer power and small in the capacity of the hard disk mountable, only speaker adaptation techniques low in recognition performance, such as “normalization of vocal tract length” and “MLLR+eigen voice space”, were available. With increase in the computer power of information equipment, speaker adaptation techniques “MLLR” and “CAT”, which can exhibit high recognition performance using the increased computer power, have become available. However, in these speaker adaptation techniques, the number of sentences the user must utter for adaptation of an acoustic model is comparatively large, and this places a large burden on the user. In addition, these techniques are not appropriate to information equipment of which the user frequently changes (for example, remote controllers of TV sets). These techniques are not appropriate to equipment comparatively small in computer power, such as home electric appliances and cellular phones, either.
In recent years, hard disks have become increasingly larger in capacity and less expensive. In this situation, speaker adaptation techniques such as a “method using clustering” and a “method using sufficient statistics”, which use a hard disk with a comparatively large capacity but only require comparatively low computer power, have made their appearance. These speaker adaptation techniques are appropriate to car navigation systems in which the capacity of the hard disk mounted has become increasingly larger, and to equipment comparatively low in computer power such as home electric appliances including TV sets and cellular phones. Small-size home electric appliances and cellular phones are not allowed to mount a large-capacity hard disk therein. However, due to the recent progress permitting communications with a large-capacity server through a network, the above speaker adaptation techniques have become available for such small-size equipment. In these speaker adaptation techniques, the number of sentences the user must utter for adaptation of an acoustic model can be reduced (to about one sentence). This reduces the burden on the user, and also enables instantaneous use even at the occasion of change of the user. However, in the “method using clustering”, in which one HMM similar to the user is selected and used as the adapted model, the recognition performance will be greatly degraded if there is available no HMM similar to the user.
In view of the above, it is considered that the speaker adaptation technique most appropriate to cellular phones, home electric appliances and the like is the “method using sufficient statistics” (Shinichi Yoshizawa, Akira Baba et al. “Unsupervised phoneme model training based on the sufficient HMM statistics from selected speakers”, Technical Report of IEICE, SP2000-89, pp.83–88 (2000)). According to this report, a high-precision adapted model (an acoustic model adapted to the voice of a user) can be obtained with one sentence utterance of the user.
A procedure for preparing an adapted model by the “method using sufficient statistics” will be described with reference to FIGS. 2 and 3.
Preparation of Selection Models and Sufficient Statistics (ST200)
Speech data of a number of speakers (for example about 300 speakers) recorded in a quiet environment is stored in advance in a speech database 310 (FIG. 3).
A selection model (represented by a Gaussian mixture model (GMM) in this case) and a sufficient statistic (represented by a hidden Markov model (HMM) in this case) are prepared for each speaker using the speech data stored in the database 310, and stored in a sufficient statistic file 320 (FIG. 3). The “sufficient statistic” refers to a statistic sufficient to represent the nature of a database, which includes the mean, variance and EM count of a HMM acoustic model in this case. The sufficient statistic is calculated by one-time training from a speaker-independent model using the EM algorithm. The selection model is prepared in the form of a Gaussian mixture model with 64 mixture components per state without distinction of the phoneme.
The preparation of sufficient statistics will be described in detail with reference to FIG. 4.
In step ST201, a speaker-independent sufficient statistic is prepared. In this case, the preparation is made by conducting training with data of all speakers is made using the EM algorithm. The sufficient statistic is represented by a hidden Markov model, with each state being represented by a mixed Gaussian distribution. Numbers are given to the Gaussian distributions of the prepared speaker-independent sufficient statistic.
In step ST202, sufficient statistics for the respective speakers are prepared using the prepared speaker-independent sufficient statistic as the initial value. In this case, the preparation is made by conducting training with data of the respective speakers using the EM algorithm. Numbers corresponding to the numbers given to the speaker-independent sufficient statistic are stored in association with the Gaussian distributions of the sufficient statistics of the respective speakers.
Input of Voice Data for Adaptation (ST210)
The voice of a user is input.
Selection of Sufficient Statistics Using Selection Models (ST220)
A plurality of sufficient statistics “similar” to the voice of the user (acoustic models of speakers acoustically similar to the user's voice) are selected based on the input voice and the selection models. The sufficient statistics “similar” to the user's voice are determined by inputting the input voice into the selection models to obtain the probability likelihood of the selection models and obtaining the sufficient statistics of the speakers corresponding to top N selection models largest in likelihood. This selection is performed by an adapted model preparation section 330 shown in FIG. 3 in the manner shown in FIG. 5.
Preparation of Adapted Model (ST230)
An adapted model is prepared using the selected sufficient statistics. To state specifically, statistics calculation (equations 1 to 3) is newly performed among the Gaussian distributions of the same number for the selected sufficient statistics, to obtain one Gaussian distribution. This preparation of an adapted model is performed by the adapted model preparation section 330 shown in FIG. 3 in the manner shown in FIG. 5.
                              μ          i          adp                =                                                            ∑                                  j                  =                  1                                                  N                  sel                                            ⁢                                                          ⁢                                                C                  mix                  j                                ⁢                                  μ                  i                  j                                                                                    ∑                                  j                  =                  1                                                  N                  sel                                            ⁢                                                          ⁢                              C                mix                j                                              -                      (                          i              =                              1                ,                                                                  ⁢                2                ⁢                                  ,                                                                          .                                                                          .                                                                          .                                                                          ⁢                                      ,                                                                                                                ⁢                                                                  ⁢                                  N                  mix                                                      )                                              Equation        ⁢                                  ⁢        1                                          v          i          adp                =                                                            ∑                                  j                  =                  1                                                  N                  sel                                            ⁢                                                          ⁢                                                C                  mix                  j                                ⁡                                  (                                                            v                      i                      j                                        +                                                                  (                                                  μ                          i                          j                                                )                                            2                                                        )                                                                                    ∑                                  j                  =                  1                                                  N                  sel                                            ⁢                                                          ⁢                              C                mix                j                                              -                                                    (                                  μ                  i                  adp                                )                            2                        ⁢                          (                              i                =                                  1                  ,                                                                          ⁢                  2                  ⁢                                      ,                                                                                  .                                                                                  .                                                                                  .                                                                                  ,                                    ⁢                                                                          ⁢                                      N                    mix                                                              )                                                          Equation        ⁢                                  ⁢        2                                                                    a              adp                        ⁡                          [              i              ]                                ⁡                      [            j            ]                          =                                                            ∑                                  k                  =                  1                                                  N                  sel                                            ⁢                                                          ⁢                                                                    C                    state                    k                                    ⁡                                      [                    i                    ]                                                  ⁡                                  [                  j                  ]                                                                                    ∑                                  j                  =                  1                                                  N                  state                                            ⁢                                                ∑                                      k                    =                    1                                                        N                    sel                                                  ⁢                                                                  ⁢                                                                            C                      state                      k                                        ⁢                                                                                  [                    i                    ]                                    ⁡                                      [                    j                    ]                                                                                ⁢                      (                                          i                ⁢                                  ,                                                                                        ⁢                                                                  ⁢                j                            =                              1                ,                                                                  ⁢                2                ⁢                                  ,                                                                          .                                                                          .                                                                          .                                                                          ,                                ⁢                                                                  ⁢                                  N                  state                                                      )                                              Equation        ⁢                                  ⁢        3            
In the above equations, the mean and variance of the normal distribution in each state of the HMM of the adapted model are expressed by μiadp (i=1, 2, . . . , Nmix) and ν iadp (i=1, 2, . . . , Nmix) where Nmix is the number of mixed distributions. The state transition probability is expressed by aadp[i][j] (i, j=1, 2, . . . , Nstate) where Nstate is the number of states, and aadp[i][j] represents the transition probability from state i to state j. Nsel denotes the number of acoustic models selected, and μij (i=1, 2, . . . , Nmix and j=1, 2, . . . , Nsel) and νij (i=1, 2, . . . , Nmix and j=1, 2, . . . , Nsel) are the mean and variance, respectively, of the respective acoustic models. Cmixj (j=1, 2, . . . , Nsel) and Cstatek[i][j] (k=1, 2, . . . , Nsel and i, j=1, 2, . . . , Nstate) are the EM count (frequency) in the normal distribution and the EM count related to the state transition, respectively.
Recognition (ST240)
A speech recognition system 300 (FIG. 3) recognizes the user's voice using the adapted model prepared as described above.
The “method using sufficient statistics” described above makes the approximation that the positional relationship among the Gaussian distributions of the speaker-independent sufficient statistic (initial value) are equal to the positional relationship among the Gaussian distributions of the sufficient statistics for the respective speakers is made. In other words, it is presumed that in the calculation of sufficient statistics of speech data from the initial-value sufficient statistic, only the mixture weight, the mean value and the variance may be trained while the positional relationship among the Gaussian distributions is maintained. To state more specifically, it is presumed that the Gaussian distribution among those of the initial-value sufficient statistic located closest to a certain Gaussian distribution of the sufficient statistic of certain speech data in the distribution distance such as a KL distance has the same number as the certain Gaussian distribution of the sufficient statistic of the certain speech data. This presumption holds in a quiet environment (see FIG. 4). This approach is therefore effective as an adapted model preparation method “in a quiet environment”. Practically, however, preparation of an adapted model “in a noisy environment” must also be considered. In such an environment, the above presumption does not hold as shown in FIG. 6, and thus the precision of the adapted model decreases.