1. Field of the Invention
The present invention relates to a voice recognition system that is robust against noises and distortions in a transmission system or the like.
2. Description of the Related Art
Conventionally, in the fields of electronic apparatuses such as in-vehicle navigation systems, public attention has been drawn to a voice recognition system that enables man-machine communications and the like. The voice recognition system is well known that is structured based on an information processing algorithm as shown in FIG. 4.
In this voice recognition system, using a Hidden Markov Model (HMM), an acoustic model (voice HMM) in units of words or subwords (phonemes, syllables, etc.) is prepared. When a voice to be recognized is uttered, an observed value series which is a time series of the cepstrum of the uttered voice is generated, the observed value series is compared with the voice HMM, and the voice MM with the maximum likelihood is selected and output as the recognition result.
More specifically, a large volume of voice data Rm experimentally collected and stored in a voice database is sectioned into frame units of approximately 10 to 20 msec and a cepstrum calculation is performed successively on the data of the frame units, thereby obtaining the time series of the cepstrum. Further, the time series of the cepstrum is trained as the feature amount of the voice so that the parameter of the acoustic model (voice HMM) reflects the time series, thereby forming the voice SM in units of words or subwords.
When a voice is actually uttered, voice recognition is performed in the following manner. The data Ra of the uttered voice is input so as to be sectioned into frame units similar to the above-mentioned ones, the observed value series which is the time series of the cepstrum is generated by performing the cepstrum calculation successively on the uttered voice data of the frame units, the observed value series is compared with the voice HMM in units of words or subwords, and the voice HMM with the maximum likelihood with respect to the observed value series is output as the voice recognition result.
However, in collecting the voice data Rm for generating the voice HMM, there are cases where voice data Rm affected by a multiplicative distortion in a microphone, electric transmission system and the like are collected. It is therefore difficult to generate an accurate voice HMM. Alternatively, there cases where an uttered voice data Ra is adversely affected by an additive noise such as a room noise or a background noise when a voice to be recognized is uttered, the characteristic of spatial transfer from the mouth to the microphone, and multiplicative distortion in the microphones electric transmission system and the like. Therefore, it is an essential challenge to construct a voice recognition system that is not readily affected by the additive noise and the multiplicative distortion, that is, a robust voice recognition system.
To address this challenge, a HMM combination method has been proposed for the additive noise, and a cepstrum mean normalization (CMN) method has been proposed for the multiplicative distortion.
A voice recognition system to which HMM combination is applied has, as shown in FIG. 5, an acoustic model of a voice (voice HMM) and an acoustic model of an additive noise (noise HMM), and forms a noise added acoustic model (combined HMM) of the voice including an additive noise by combining the voice HMM and the noise HMM, compares the combined HMM with an observed value series generated based on the uttered voice data, and outputs the combined HMM with the maximum likelihood as the voice recognition result.
Here, the voice HMM is formed by sectioning the data Sm of a clean voice including no additive noise into frames, and performing the cepstrum calculation and training.
The noise HMM is formed by sectioning noise data Nm collected from a non-voice section into frames like in the case of the voice HMM, and performing the cepstrum calculation and training.
The combined HMM is formed by adding the voice HMM to the noise HMM in a linear region. However since the voice HMM and the noise HMM are expressed as distributions Sm(cep) and Nm(cep) in the cepstrum domain (cep), it is impossible to obtain the combined HMM in the cepstrum domain,
Therefore, first, the distribution Sm(cep) of the voice HMM and the distribution Nm(cep) of the noise HMM are cosine-transformed to distributions Sm(log) and Nm(log) in a logarithmic spectrum domain (log), the distributions Sm(log) and Nm(log) are exponentially transformed to distributions Sm(lin) and Nm(lin) in a linear spectrum domain (lin), the distribution Nm(lin) is multiplied by a predetermined coefficient k depending on the ratio between the average power of the voice Rm in a voice database 2 and the average power of the additive noise Nm, and the SN ratio of the uttered voice Ra, and the result of the multiplication is added to the distribution Sm(lin), thereby obtaining the distribution Rm(lin)=Sm(lin)+k·Nm(lin) of the noise added voice in the linear spectrum domain. Then, the distribution Rm(lin) of the noise added voice is logarithmically transformed to a distribution Rm(log) in the logarithmic spectrum domain (log) and is inverse-cosine-transformed to obtain the distribution Rm(cep) of the noise added voice in the cepstrum domain (cep), thereby forming the combined HMM.
According to this HMM combination, since the actual uttered voice Ra is expressed as the sum Ra(lin)=Sa(lin)+Na(lin) of the clean voice Sa(lin) and the additive noise Na(lin) in the linear spectrum domain (lin) and the noise added voice model (combined HMM) is expressed as the sum Rm(lin)=Sm(lin)+k·Nm(lin) of the clean voice Sm(lin) and the additive noise k·Nm(lin) in the linear spectrum domain (lin), it is considered that the effect of the additive noise can be restrained when the observed value series Ra(cep) is compared with the distribution Rm(cep) of the combined HMM. The coefficient k is a predetermined constant.
In a voice recognition system to which CMN is applied, as shown in FIG. 6, voice data Rm including a multiplicative distortion is previously collected and stored in a voice database, and by sectioning the voice data Rm into frames and performing the cepstrum calculation and training, the voice HMM is formed. That is, when the multiplicative distortion is Hm and a clean voice including no multiplicative distortion is Sm, the voice HMM is structured as a distribution Rm(cep)=Hm(cep)+Sm(cep) in the cepstrum domain (cep).
Further, the multiplicative distortion Hm(cep) is obtained by averaging the distribution Rm(cep) of the voice HMM for a predetermined time based on the assumption that the cepstrum of the multiplicative distortion can be estimated from the long-time average of the cepstrum of the voice, and the distribution Sm(cep) of the clean voice in the cepstrum domain (cep) is generated by subtracting the multiplicative distortion Hm(cep) from the distribution Rm(cep).
When a voice is actually uttered, by sectioning the data Ra of the uttered voice into frames and performing the cepstrum calculation, the cepstrum Ra(cep)=Sa(cep)+Ha(cep) of the uttered voice in which the actual multiplicative distortion Ha is included in the clean voice Sa is obtained. Further, by averaging the cepstrum Ra(cep) of the uttered voice for a predetermined time based on the assumption that the cepstrum of the multiplicative distortion can be estimated from the long-time average of the cepstrum of the voice, the multiplicative distortion Ha(cep) is obtained. Further, by subtracting the multiplicative distortion Ha(cep) from the cepstrum Ra(cep) of the uttered voice, the cepstrum Sa(cep) of the clean voice Sa is generated. The cepstrum Sa(cep) is compared with the distribution Sm(cep) obtained from the voice HMM, and the voice HMM with the maximum likelihood is output as the recognition result.
As described above, according to CMN, since the distribution Sm(cep) in the cepstrum domain (cep) from which the multiplicative distortion Hm(cep) is removed is compared with the cepstrum Sa(cep) of the uttered voice from which the multiplicative distortion Ha(cep) is removed, it is considered that voice recognition robust against multiplicative distortions is possible.
As another voice recognition system using CMN, one having the structure shown in FIG. 7 is known. In this voice recognition system, like in the voice recognition system shown in FIG. 6, the multiplicative distortion Hm(cep) is obtained by averaging the distribution Rm(cep) of the voice HMM for a predetermined time. Further the cepstrum Ra(cep)=Sa(cep)+Ha(cep) of the uttered voice is obtained, and the multiplicative distortion Ha(cep) is obtained by averaging the cepstrum Ra(cep) of the uttered voice for a predetermined time. Further, the cepstrum Sa(cep) of the clean uttered voice is generated by subtracting the multiplicative distortion Ha(cep) from the cepstrum Ra(cep) of the uttered voice.
Here, the cepstrum Sa(cep)+Hm(cep) including the multiplicative distortion Hm(cep) is generated by adding the multiplicative distortion Hm(cep) obtained from the distribution Rm(cep) of the voice HMM to the cepstrum Sa(cep) of the clean uttered voice, the distribution Rm(cep)=Hm(cep)+Sm(cep) of the voice HMM is compared with the cepstrum Sa(cep)+Hm(cep), and the voice HMM with the maximum likelihood is output as the recognition result.
Therefore, in the voice recognition system shown in FIG. 7, like in the voice recognition system shown in FIG. 6, it is considered that voice recognition robust against multiplicative distortions is possible by performing a processing based on the assumption that the cepstrum of the multiplicative distortion can be estimated from the long-time average of the cepstrum of the voice.
Moreover, a voice recognition system is known that is provided with expandability by using both HMM combination and CMN as shown in FIG. 8.
In this voice recognition system, like in the system shown in FIG. 5, an acoustic model of a voice (voice HMM) and an acoustic model of a noise (noise HMM) are formed, and the multiplicative distortion Hm(cep) obtained by averaging the distribution Rm(cep) of the voice HMM in the cepstrum domain (cep) for a predetermined time is subtracted from the distribution Rm(cep) there by obtaining the distribution Sm(cep) of the voice excluding the multiplicative distortion.
Then, the distribution Sm(cep) of the clean voice in the cepstrum domain and the distribution Nm(cep) of the noise HMM in the cepstrum domain are cosine-transformed to obtain distributions Sm(log) and Nm(log) in the logarithmic spectrum domain, the distributions Sm(log) and Nm(log) are exponentially transformed to obtain distributions Sm(lin) and Nm(lin) in the linear spectrum domain (lin), the distribution Nm(lin) is multiplied by a predetermined coefficient k depending on the SN ratio, and the result of the multiplication is added to the distribution Sm(lin), thereby obtaining the distribution R′m(lin)=Sm(lin)+k·Nm(lin) of the noise added voice.
Then, the distribution K′m(lin) of the noise added voice is logarithmically transformed to a distribution R′m(log) in the logarithmic spectrum domain (log) and is inverse-cosine-transformed to obtain the distribution R′m(cep) of the noise added voice in the cepstrum domain (cep), thereby forming the combined HMM.
That is, the combined HMM is structured as the cepstrum of the noise added voice generated by removing the multiplicative distortion Hm from the voice Rm and adding the additive noise Nm to the voice from which the multiplicative distortion Hm is removed.
When a voice is actually uttered, by sectioning the data Ra of the uttered voice into frames and performing the cepstrum calculation, the cepstrum Ra(cep)=Ha(cep)+R^a(cep) of the uttered voice in which the actual multiplicative distortion Ha and the additive noise Na are included in the clean voice Sa is obtained. Then, by averaging the cepstrum Ra(cep) for a predetermined time, the multiplicative distortion Ha(cep) is obtained, and by subtracting the multiplicative distortion Ha(cep) from the cepstrum Ra(cep) of the uttered voice, the cepstrum R^a(cep) of the uttered voice excluding the multiplicative distortion Ha(cep) is generated. That is, the cepstrum R^a(cep) is the cepstrum of the uttered voice including the additive noise Na and from which the multiplicative distortion Ha is removed.
Then, the cepstrum R^a(cep) is compared with the distribution Rm(cep) of the combined HMM, and the combined HMM with the maximum likelihood is output as the recognition result.
However, in the voice recognition system shown in FIG. 8 to which CMN and HMM combinations are applied, although voice recognition is performed by comparing the combined HMM with the cepstrum R^a(cep) of the uttered voice, the combined HMM is not modeled as an appropriate object of comparison with the uttered voice.
That is, when the actually uttered voice Ra includes the multiplicative distortion Ha and the additive noise Na, the uttered voice Ra can be expressed, as the clean uttered voice Sa on which the multiplicative distortion Ha and the additive noise Na are superimposed, as shown by the following equation (1) in a linear spectrum domain (lin):                                                                         Ra                ⁡                                  (                  lin                  )                                            =                            ⁢                                                                    Ha                    ⁡                                          (                      lin                      )                                                        ⁢                                      Sa                    ⁡                                          (                      lin                      )                                                                      +                                  Na                  ⁡                                      (                    lin                    )                                                                                                                          =                            ⁢                                                Ha                  ⁡                                      (                    lin                    )                                                  ⁢                                  {                                                            Sa                      ⁡                                              (                        lin                        )                                                              +                                                                  Na                        ⁡                                                  (                          lin                          )                                                                    /                                              Ha                        ⁡                                                  (                          lin                          )                                                                                                      }                                                                                                        =                            ⁢                                                Ha                  ⁡                                      (                    lin                    )                                                  ⁢                                  R                  ^                                      a                    ⁡                                          (                      lin                      )                                                                                                                              (        1        )            
In the voice recognition system shown in FIG. 8, by sectioning the uttered voice Ra expressed as the linear spectrum domain (lin) into frames and performing the cepstrum calculation, the cepstrum Ra(cep) of the uttered voice Ra as shown by the following equation (2) is obtained:
 Ra(cep)=Ha(cep)+R^a(cep)  (2)
Then, by removing the multiplicative distortion Ha(cep) in the cepstrum domain (cep) by CMN, the cepstrum R^a(cep) to be compared with is obtained. The cepstrum R^a(cep) corresponds to the linear spectrum {Sa(lin)+Na(lin)/Ha(lin)} in the equation (1).
On the contrary, the combined HMM is generated based on the noise added voice as explained with reference to FIG. 8. The following equation (3) represents the noise added voice expressed in the linear spectrum domain (lin), and the following equation (4) represents the combined HMM expressed in the cepstrum domain (cep):
R′m(lin)=Sm(lin)+k·Nm(lin)  (3)                                                                                           R                  ′                                ⁢                                  m                  ⁡                                      (                    cep                    )                                                              =                            ⁢                              IDCT                ⁡                                  [                                      log                    ⁢                                                                                   ⁢                                          {                                                                        R                          ′                                                ⁢                                                  m                          ⁡                                                      (                            lin                            )                                                                                              }                                                        ]                                                                                                        =                            ⁢                              cep                ⁡                                  [                                                            R                      ′                                        ⁢                                          m                      ⁡                                              (                        lin                        )                                                                              ]                                                                                        (        4        )            
The operator log represents logarithmic transformation, the operator IDCT represents inverse cosine transformation, and the operator cep represents the inverse cosine transformation of the logarithmic transformation, that is, IDCT[log{ }].
Contrasting the equation (2) with the equation (4), since the cepstrum R^a(cep) generated based on the uttered voice Ra shown in the equation (2) corresponds to the linear spectrum {Sa(lin)+Na(lin)/Ha(lin)} in the equation (1), a component which is the quotient when the additive noise Na(lin) is divided by the multiplicative distortion Ha(lin) is included, whereas since the cepstrum R′m(cep) of the combined HMM shown in the equation (4) corresponds to the linear spectrum Sm(lin)+k·Nm(lin) shown in the equation (3), it is not performed to divide the additive noise Nm(lin) by some multiplicative distortion.
Thus, the combined HMM is not appropriately modeled as an object of comparison for recognizing the actual uttered voice.