As an example of devices for creating a speaker adaptive model for use in voice recognition processing and the like, there is known a device for selecting a speaker model, the acoustic feature value of which is similar to that of an utterance speaker, from among a large number of speaker models prestored in a storage unit, and for creating a speaker adaptive model for the utterance speaker based on the selected speaker model. An example of a speaker selecting device in such a speaker adaptive model creating device is disclosed in Non Patent Document 1 and Patent Document 1. Note that selecting a speaker model, the acoustic feature value of which is similar to that of the utterance speaker, is hereinafter referred to as “selecting a speaker” or “speaker selection”. Further, the “speaker adaptive model” is also referred to as an “adaptive model”.
An adaptive model creating method employed in the speaker adaptive model creating device disclosed in Non Patent Document 1 includes selecting a speaker, the acoustic feature value of which is similar to that of the utterance speaker, and creating a phoneme model adapted to the utterance speaker by using the sufficient statistic of the selected speaker, and the method consists mainly of three steps.
First, a sufficient statistic relating to an HMM (Hidden Markov Model) for each speaker is calculated and accumulated. The sufficient statistic refers to a statistic sufficient to represent the nature of a database, which includes the mean, variance, and EM count of a phoneme model described by an HMM in the method disclosed in Non Patent Document 1. The EM count refers to the probabilistic frequency of transition from a state i to a state j of a Gaussian distribution k in an EM algorithm. The sufficient statistic is calculated by one-time learning from an unspecified speaker model in the EM algorithm by use of voice data for each speaker.
Next, a speaker, the acoustic feature value of which is similar to that of the utterance speaker, is selected using a speaker model described by a GMM (Gaussian Mixture Model: probabilistic model of observed data described by mixture gaussian distribution). Specifically, the top N number of speakers having a high acoustic likelihood, which is obtained by inputting an input voice to the speaker model, are selected. Note that selection of speakers is equivalent to selection of sufficient statistics corresponding to the speakers. In the method disclosed in Non Patent Document 1, a speaker model is created in advance using a 1-state 64-mixture GMM without distinction of the phoneme. Further, a value N is empirically determined, and one arbitrary utterance is used as adaptive data.
Lastly, a phoneme model adapted to the utterance speaker is created through statistical processing using the sufficient statistics corresponding to the speakers selected using the speaker model.
      [          Equation      ⁢                          ⁢      1        ]                                            μ            i            adp                    =                                                                      ∑                                      j                    =                    1                                                        N                    sel                                                  ⁢                                                      C                    mix                    j                                    ⁢                                      μ                    i                    j                                                                                                ∑                                      j                    =                    1                                                        N                    sel                                                  ⁢                                  C                  mix                  j                                                      ⁢                          (                                                i                  =                  1                                ,                2                ,                …                ⁢                                                                  ,                                  N                  mix                                            )                                                            (                      Equation            ⁢                                                  ⁢            1                    )                    
      [          Equation      ⁢                          ⁢      2        ]                                            v            i            adp                    =                                                                      ∑                                      j                    =                    1                                                        N                    sel                                                  ⁢                                                      C                    mix                    j                                    ⁡                                      (                                                                  v                        i                        j                                            +                                                                        (                                                      μ                            i                            j                                                    )                                                2                                                              )                                                                                                ∑                                      j                    =                    1                                                        N                    sel                                                  ⁢                                  C                  mix                  j                                                      -                                                            (                                      μ                    i                    adp                                    )                                2                            ⁢                              (                                                      i                    =                    1                                    ,                  2                  ,                  …                  ⁢                                                                          ,                                      N                    mix                                                  )                                                                          (                      Equation            ⁢                                                  ⁢            2                    )                    
      [          Equation      ⁢                          ⁢      3        ]                                                                          a                adp                            ⁡                              [                i                ]                                      ⁡                          [              j              ]                                =                                                                      ∑                                      k                    =                    1                                                        N                    sel                                                  ⁢                                                                            C                      state                      k                                        ⁡                                          [                      i                      ]                                                        ⁡                                      [                    j                    ]                                                                                                ∑                                      j                    =                    1                                                        N                    state                                                  ⁢                                                      ∑                                          k                      =                      1                                                              N                      sel                                                        ⁢                                                                                    C                        state                        k                                            ⁡                                              [                        i                        ]                                                              ⁡                                          [                      j                      ]                                                                                            ⁢                          (                              i                ,                                  j                  =                  1                                ,                2                ,                …                ⁢                                                                  ,                                  N                  state                                            )                                                            (                      Equation            ⁢                                                  ⁢            3                    )                    where, μiadp (i=1, . . . , Nmix) and νiadp (i=1, . . . , Nmix) respectively represent the mean and variance of a Gaussian distribution in each state of the HMM of the adaptive model, and Nmix represents the number of mixed distributions. Further, aadp[i][j] (i, j=1, . . . , Nstate) represents the transition probability from a state i to a state j, and Nstate represents the number of states. Nsel represents the number of selected speakers, and μij (i=1, . . . , Nmix, j=1, . . . , Nsel) and νij (i=1, . . . , Nmix, j=1, . . . , Nsel) respectively represent the mean and variance of a phoneme model of a selected speaker. Furthermore, Cmixj (j=1, . . . , Nsel) and Cstatek[i][j] (k=1, . . . , Nsel, i, j=1, . . . , Nstate) respectively represent the EM count in the Gaussian distribution and the EM count relating to the state transition.
The adaptive model creating device disclosed in Patent Document 1 is a device obtained by modifying the adaptive model creating device disclosed in Non Patent Document 1 so as to prevent a deterioration in accuracy for the adaptive model under a noise environment. The adaptive model creating device disclosed in Patent Document 1 includes an accumulation section, a first selection section, a second selection section, and a model creating section. The accumulation section accumulates sufficient statistics created using voice data contained in groups, for each of a plurality of groups obtained by grouping the voice data, on which noise is superimposed, based on the acoustic similarity. For example, groups are formed for each (noise type×SN ratio), and sufficient statistics for each (speaker×voice variation of speaker) are accumulated in the groups. The first selection unit selects a group acoustically similar to voice data of the utterance speaker from among the plurality of groups. The second selection unit selects a sufficient statistic acoustically similar to the voice data of the utterance speaker from among the sufficient statistics relating to the group selected by the first selection group. The model creating section creates an acoustic model using the sufficient statistics selected by the second selection unit.
[Patent Document 1]
    Japanese Patent No. 3756879[Non Patent Document 1]    “Unsupervised Phoneme Model Training Based on the Sufficient HMM Statistics from Selected Speakers”, Authors; Yoshizawa Shinichi, Baba Akira, Matsunami Kanako, Mera Yuichiro, Yamada Miichi, Lee Akinobu, Shikano Kiyohiro; The Institute of Electronics, Information and Communication Engineers, March 2002, Vol. J85-D-II, No. 3, pages 382-389