In recent years, the widespread use of the Internet or the like has advanced the increase in network capacity and the reduction in communication costs. On account of this, it has become possible to collect many recognition models (reference models) using the networks. For example, as to speech recognition, it has become possible to download, via the Internet, many speech recognition models (such as a child model, an adult model, an elderly model, an in-vehicle model, and a cell-phone model) distributed by various research institutions. Also, owing to the network connection between devices, it has become possible to download a speech recognition model to be used for a car navigation system or the like, to a television or a personal computer. As to intention interpretation, it has become possible to collect, via the network, recognition models which have learned experiences of various people living in different places.
Moreover, owing to the development of recognition technology, recognition models are used by a wide variety of devices, such as a personal computer, a television remote control, a cellular phone, and a car navigation system, each of which has different specifications for its CPU power, its memory capacity, etc. They are also used for a wide variety of applications which respectively require different specifications. As examples of such applications, there are an application requiring a recognition precision for security, etc., and an application requiring rapidity when a recognition result is outputted as in a case where an operation is performed using a television remote control.
Furthermore, the recognition technology is used in many environments in which recognition objects are different. For example, the speech recognition technology is used in many environments, such as where voices of children, adults, and the elderly are to be recognized and where voices in a vehicle or on a cellular phone are to be recognized.
In view of these changes in the social environment, it is preferable to create and provide a user with, in a short period of time, a high-precision recognition model (standard model) suitable for the specifications of apparatuses and applications and for usage environments, by effectively utilizing many recognition models (reference models).
In the field of pattern recognition such as speech recognition, a method that employs a probability model as a standard recognition model has received attention in recent years. Particularly, a hidden Markov model (referred to as an HMM hereafter) and a Gaussian mixture distribution model (referred to as a GMM hereafter) are widely used. Meanwhile, as to the intention interpretation, attention has been given in recent years to a method that employs a probability model as a standard recognition model representing intention, knowledge, preference, etc. Particularly, a Bayesian net and the like are widely used. In the field of data mining, attention has been given to a method that employs a probability model as a representative model for each category in order to classify data, and the GMM and the like are widely used for this. In the field of authentication such as speech authentication, fingerprint authentication, face authentication, and iris authentication, a method employing a probability model as a standard authentication model has received attention, and the GMM and the like are used. As a learning algorithm of a standard model represented by an HMM, the Baum-Welch re-estimation method is widely used (see, for example, Hijiri Imai, “Speech Recognition (Onsei Ninshiki)”, Kyoritsu Shuppan Kabushikigaisha, Nov. 25, 1995, pp. 150-152). As a learning algorithm of a standard model represented by a GMM, the EM (Expectation-Maximization) algorithm is widely used (see, for example, Hiro Furui, “Speech Information Processing (Onsei Jouhou Shori)”, Morikita Shuppan Kabushikigaisha, Jun. 30, 1998, pp. 100-104). According to the EM algorithm, the standard model is expressed as follows.
                              ∑                      m            =            1                                M            f                          ⁢                                  ⁢                              ω                          f              ⁡                              (                m                )                                              ⁢                      f            ⁡                          (                                                x                  ;                                      μ                                          f                      ⁡                                              (                        m                        )                                                                                            ,                                  σ                                      f                    ⁡                                          (                      m                      )                                                        2                                            )                                                          (                  Equation          ⁢                                          ⁢          1                )            (Here:f(x;μf(m),σf(m)2) (m=1, 2, . . . , Mf)  (Equation 2)represents a Gaussian distribution; andx=(x(1),x(2), . . . ,x(J))εRJ  (Equation 3)represents input data in J (≧1) dimension.) A mixture weighting coefficient showing a statistic represented asωf(m) (m=1,2, . . . , Mf),  (Equation 4)a mean value in J (≧1) dimension represented asμf(m)=(μf(m,1),μf(m,2), . . . ,μf(m,J))εRJ(m=1,2, . . . , Mf,j=1,2, . . . ,J),  (Equation 5)and a variance in J (≧1) dimension (I diagonal elements of the covariance matrix) represented asσf(m)2=(σf(m,1)2,σf(m,2)2, . . . ,σf(m,J)2)εRJ(m=1,2, . . . , Mf, j=1,2, . . . ,J)  (Equation 6)are repeatedly calculated one or more times for learning so as to maximize or locally maximize, on the basis of the N sets of learning data represented asx[i]=(x(1)[i],x(2)[i], . . . ,x(J)[i])εRJ (i=1,2, . . . , N),   (Equation 7)the likelihood with respect to the learning data, the likelihood being represented as
                              log          ⁢                                          ⁢          P                =                              ∑                          i              =              1                        N                    ⁢                                          ⁢                                    log              ⁡                              [                                                      ∑                                          m                      =                      1                                                              M                      f                                                        ⁢                                                                          ⁢                                                            ω                                              f                        ⁡                                                  (                          m                          )                                                                                      ⁢                                          f                      ⁡                                              (                                                                                                            x                              ⁡                                                              [                                i                                ]                                                                                      ;                                                          μ                                                              f                                ⁡                                                                  (                                  m                                  )                                                                                                                                              ,                                                      σ                                                          f                              ⁡                                                              (                                m                                )                                                                                      2                                                                          )                                                                                            ]                                      .                                              (                  Equation          ⁢                                          ⁢          8                )            For such calculations, the following equations are used:
                                          ω                          f              ⁡                              (                m                )                                              =                                                    ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                              γ                ⁡                                  (                                                            x                      ⁡                                              [                        i                        ]                                                              ,                    m                                    )                                                                                    ∑                                  k                  =                  1                                                  M                  f                                            ⁢                                                          ⁢                                                ∑                                      l                    =                    1                                    N                                ⁢                                                                  ⁢                                  γ                  ⁡                                      (                                                                  x                        ⁡                                                  [                          i                          ]                                                                    ,                      k                                        )                                                                                      ⁢                                  ⁢                              (                                          m                =                1                            ,              2              ,              …              ⁢                                                          ,                              M                f                                      )                    ;                                    (                  Equation          ⁢                                          ⁢          9                )            
                                          μ                          f              ⁡                              (                                  m                  ,                  j                                )                                              =                                                    ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                                                γ                  ⁡                                      (                                                                  x                        ⁡                                                  [                          i                          ]                                                                    ,                      m                                        )                                                  ⁢                                  x                                      (                    j                    )                                                                                                      ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                              γ                ⁡                                  (                                                            x                      ⁡                                              [                        i                        ]                                                              ,                    m                                    )                                                                    ⁢                                  ⁢                              (                                          m                =                1                            ,              2              ,              …              ⁢                                                          ,                              M                f                            ,                              j                =                1                            ,              2              ,              …              ⁢                                                          ,              J                        )                    ;                                    (                  Equation          ⁢                                          ⁢          10                )            and
                                          σ                          f              ⁡                              (                                  m                  ,                  j                                )                                      2                    =                                                    ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                                                γ                  ⁡                                      (                                                                  x                        ⁡                                                  [                          i                          ]                                                                    ,                      m                                        )                                                  ⁢                                                      (                                                                  x                                                  (                          j                          )                                                                    -                                              μ                                                  f                          ⁡                                                      (                                                          m                              ,                              j                                                        )                                                                                                                )                                    2                                                                                    ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                              γ                ⁡                                  (                                                            x                      ⁡                                              [                        i                        ]                                                              ,                    m                                    )                                                                    ⁢                                  ⁢                              (                                          m                =                1                            ,              2              ,              …              ⁢                                                          ,                              M                f                            ,                              j                =                1                            ,              2              ,              …              ⁢                                                          ,              J                        )                    .                                    (                  Equation          ⁢                                          ⁢          11                )            (Here:
                              γ          ⁡                      (                                          x                ⁡                                  [                  i                  ]                                            ,              m                        )                          =                                                            ω                                  f                  ⁡                                      (                    m                    )                                                              ⁢                              f                ⁡                                  (                                                                                    x                        ⁡                                                  [                          i                          ]                                                                    ;                                              μ                                                  f                          ⁡                                                      (                            m                            )                                                                                                                ,                                          σ                                              f                        ⁡                                                  (                          m                          )                                                                    2                                                        )                                                                                    ∑                                  k                  =                  1                                                  M                  f                                            ⁢                                                          ⁢                                                ω                                      f                    ⁡                                          (                      k                      )                                                                      ⁢                                  f                  ⁡                                      (                                                                                            x                          ⁡                                                      [                            i                            ]                                                                          ;                                                  μ                                                      f                            ⁡                                                          (                              k                              )                                                                                                                          ,                                              σ                                                  f                          ⁡                                                      (                            k                            )                                                                          2                                                              )                                                                                ⁢                      (                                          m                =                1                            ,              2              ,              …              ⁢                                                          ,                              M                f                                      )                                              (                  Equation          ⁢                                          ⁢          12                )            Moreover, a method such as the Bayes estimation method has been suggested (see, for example, Kazuo Shigemasu, “Introduction to Bayesian Statistic (Beizu Toukei Nyumon)”, Tokyo Daigaku Shuppankai, Apr. 30, 1985, pp. 42-53). In each of the leaning algorithms, including the Baum-Welch re-estimation method, the EM algorithm, and the Bayes estimation method, a standard model is created by calculating parameters (statistics) of the standard model so as to maximize or locally maximize the probability (likelihood) with respect to the learning data. These learning methods realize maximization or local maximization of the probability (likelihood), that is to say, the mathematical optimization is realized.
In a case where the above-stated learning methods are used for creating a standard model for speech recognition, it is preferable to learn the standard model based on a number of sets of speech data in order to respond to variations in the amount of acoustic characteristics, such as various kinds of speakers and noises. In a case where these methods are used for intention interpretation, it is preferable to learn the standard model based on a number of sets of data in order to respond to variations in speakers and circumstances. Also, in a case where these methods are used for iris authentication, it is preferable to learn the standard model based on a number of sets of iris image data in order to respond to variations in the sunlight and the position and rotation of a camera. However, when such a number of sets of data are treated, it requires an immense amount of time and, therefore, the standard model cannot be provided for the user in a short period of time. In addition, the cost to accumulate such a great amount of data will become enormous. Also, if such data is collected via the network, the communication cost becomes enormous.
Meanwhile, there is a suggested method by which a standard model is created by synthesizing a plurality of models (hereafter, a model prepared for reference in creating a standard model is referred to as a “reference model”). The reference model is a probability distribution model where: a number of sets of learning data is expressed by population parameters (mean, variance, etc.) of a probability distribution; and characteristics of a number of sets of learning data are integrated by a small number of parameters (population parameters). In the conventional technologies described below, the model is represented by the Gaussian distribution.
According to a first conventional method, a reference model is represented by a GMM, and a standard model is created by synthesizing GMMs of a plurality of the reference models by their weights (this technology is disclosed in Japanese Laid-Open Patent Application No. 4-125599, for example).
According to a second conventional method, in addition to the first conventional method, a standard model is created by learning a mixture weight combined linearly through maximization or local maximization of the probability (likelihood) with respect to learning data (this technology is disclosed in Japanese Laid-Open Patent Application No. 10-268893, for example).
According to a third conventional method, a standard model is created by expressing mean values of the standard model using linear combination of mean values of reference models, and then learning a linear combination coefficient by maximizing or locally maximizing the probability (likelihood) with respect to input data. Here, speech data of a specific speaker is used as the learning data, and the standard model is used as a speaker adaptive model for speech recognition (see, for example, M. J. F. Gales, “Cluster Adaptive Training for Speech Recognition”, 1998, Proceedings of ICSLP98, pp. 1783-1786).
According to a fourth conventional technology, a reference model is represented by a single Gaussian distribution. A standard model is created by synthesizing the Gaussian distributions of a plurality of reference models and then integrating the Gaussian distributions belonging to the same class through clustering (see Japanese Laid-Open Patent Application No. 9-81178, for example).
According to a fifth conventional technology, a plurality of reference models are represented by Gaussian mixture distributions having the same number of mixtures, and a serial number is assigned to each Gaussian distribution on a one-on-one basis. A standard model is created by synthesizing the Gaussian distributions having the same serial number. A plurality of the reference models to be synthesized are created based on speakers that are acoustically similar to the user, and the standard model to be created is a speaker adaptive model (see, for example, Yoshizawa and six others, “Unsupervised Method for Learning Phonological Model using Sufficient Statistic and Speaker Distance (Jubuntoukeiryo To Washakyori Wo Mochiita Onin Moderu No Kyoushi Nashi Gakushuhou)”, the Institute of Electronics, Information and Communication Engineers, Mar. 1, 2002, Vol. J85-D-II, No. 3, pp. 382-389).
Using the first conventional technology, however, the number of mixtures of the standard model is increased along with an increase in the number of the reference models to be synthesized. Thus, the storage capacity and amount of recognition processing for the standard model become enormous, and this is impractical. In addition, the number of mixtures of the standard model cannot be controlled in accordance with the specifications. This problem is considered to become prominent with an increase in the number of the reference models to be synthesized.
Using the second conventional technology, the number of mixtures of the standard model is increased along with an increase in the number of the reference models to be synthesized. Thus, the storage capacity and amount of recognition processing for the standard model become enormous, and this is impractical. In addition, the number of mixtures of the standard model cannot be controlled in accordance with the specifications. Moreover, since the standard model is a simple mixed sum of the reference models and a parameter to be learned is limited to a mixture weighting, a high-precision standard model cannot be created. Furthermore, since the learning is performed using great amounts of learning data to create the standard model, it requires a long period of learning time. These problems are considered to become prominent with an increase in the number of the reference models to be synthesized.
Using the third conventional technology, a parameter to be learned is limited to a linear combination coefficient of the mean values of the reference models. For this reason, a high-precision standard model cannot be created. Moreover, since the learning is performed using great amounts of learning data to create the standard model, it requires a long period of learning time.
Using the fourth conventional technology, clustering is heuristically performed and, therefore, it is difficult to create a high-precision standard model. Moreover, the precision of the reference model is low due to the single Gaussian distribution, and the precision of the standard model that is created by integrating such models is also low. The problem related to the recognition precision is considered to become prominent with an increase in the number of the reference models to be synthesized.
Using the fifth conventional technology, the standard model is created by synthesizing the Gaussian distributions having the same serial number. However, in order to create an optimum standard model, the Gaussian distributions to be synthesized do not always correspond on a one-on-one basis in general. For this reason; the precision of the recognition decreases. Moreover, in a case where a plurality of the reference models have the different numbers of mixtures, the standard model cannot be created. Furthermore, a serial number is not assigned to the Gaussian distribution of the reference model in general and, in this case, the standard model cannot be created. In addition, the number of mixtures of the standard model cannot be controlled in accordance with the specifications.