1. Field of the Invention
The present invention relates to a voice recognition system that is robust with respect to noise and distortions in a transmission system, etc.
2. Description of the Related Art
In the conventional arts, in an electronic device such as, for example, a navigation apparatus, etc., which is incorporated in an automobile, a voice recognition system that enables man-to-machine communications has been noted. As shown in FIG. 3, a voice recognition system has been known that is constructed on the basis of an information processing algorithm.
The voice recognition system generates in advance an acoustic model (voice HMM) consisting of words or subwords (phoneme, syllable, etc.,) by using the Hidden Markov Model (HMM), generates an observation value series Ra (cep) that is a time series of cepstrum with respect to utterance voice Ra when a voice Ra to be recognized is uttered, collates the observation value series Ra(cep) with the voice HMM, select a voice HMM that gives the highest likelihood, and outputs it as a result of recognition.
In further detail, the voice recognition system is provided with a voice HMM generating portion 5 that generates the above-described voice HMM in compliance with the HMM method, and the voice HMM generating portion 5 comprises a voice database 1, frame-by-frame fragmenting section 2, cepstrum operating section 3, and a training section 4.
The frame-by-frame fragmenting section 2 divides a great amount of voice data Rm of a testee, which has been experimentally collected and stored in the voice database 1, into frames each consisting of 10 through 20 msec or so, and the cepstrum operating section 3 operates respective frame-by-frame data in terms of cepstrum, whereby a time series Rm(cep) of the cepstrum is obtained.
Further, the training section 4 processes to train the time series Rm(cep) of the cepstrum as a feature of the voice (feature vector), and reflects it to parameters of the acoustic model (voice HMM), whereby the voice HMM 6 consisting of words or subwords is generated in advance.
When a utterance is actually carried out, data Ra of the utterance voice are divided into frames by a frame-by-frame fragmenting section 7 as in the frame-by-frame fragmenting section 2, and respective frame-by-frame utterance voice data are operated one after another by the cepstrum operating section 8, where by an observation value series Ra (cep) being a time series of the cepstrum is generated.
And, a collating section 9 collates the observation value series Ra (cep) with the voice HMM 6 in terms of words or subwords, and outputs the voice HMM which has the highest likelihood with respect to the observation value series Ra (cep), as the results of voice recognition.
However, in the voice recognition system shown in FIG. 3, voice data Rm that has been influenced by multiplicative distortions in a microphone and electric transmission system, etc., would be collected when collecting the voice data Rm to generate the voice HMM 6, wherein a problem resides in that it is difficult to satisfactorily generate accurate voice HMM 6
In addition, when an utterance voice Ra to be recognized is uttered, the additive noise such as indoor noise, background noise, etc., multiplicative distortions such as spatial transmission characteristics from the mouth to a microphone, and transmission characteristics in the microphone and electric transmission system, etc., adversely influence the observation value series Ra (cep), wherein such a problem resides in a lowering of the ratio of voice recognition.
In order to solve these and other problems, it is an essential issue to construct a voice recognition system that is scarcely influenced by the additive noise and multiplicative distortions, that is, a robust voice recognition system.
The present inventor has made an attempt to achieve a robust voice recognition system by applying an HMM combining method to the additive noise and the cepstrum means normalization method (CMN) to the multiplicative distortions in order to cope with the above-described subject.
FIG. 4 is a block diagram showing a configuration of the voice recognition system. The voice recognition system is provided with a voice HMM 10, an initial noise HMM 17, an initial combination HMM 16 and an adaptive HMM 26, wherein, when a voice to be recognized is uttered, observation value series RNa (cep) being the cepstrum time series, which has been obtained by the uttered voice, and adaptive HMM 26 are collated with each other by a collating section 29 in terms of words or subwords, and the adaptive HMM that has the highest likelihood with respect to the observation value series RNa (cep) is outputted as the results of voice recognition.
Further, since an amount of operation is increased if the HMM combination method is applied, a model adaptive method based on the Taylor expansion is employed in order to achieve high speed processing by decreasing the amount of operation. That is, by providing a Jacobian matrix calculating section 19 that calculates a primary derivative matrix of the Taylor expansion, which is called the “Jacobian Matrix J”, an attempt is made to decrease the amount of operation.
The above-described voice HMM 10 is an acoustic model generated in advance by the HMM method using utterance voice Rm that is collected and does not include any additive noise. That is, the voice HMM 10 is generated in advance by processing based on an HMM method similar to that of the voice HMM generating section 5 shown in FIG. 3.
Also, by experimentally collecting utterance voice Rm in an anechoic room, a voice HMM 10 free from influences of the additive noise is generated. However, since influences due to multiplicative distortions in a microphone and electric transmission system, etc., cannot be removed, the voice HMM 10 becomes an acoustic model in which influences due to the multiplicative distortions remains.
Therefore, where it is assumed that the experimentally collected utterance voice Rm consists of clean voice Sm (voice not including any additive noise and multiplicative distortion) and multiplicative distortions Hm, if the utterance voice Rm is expressed in terms of linear spectral domain (lin), it is expressed by a product of the linear spectrum in which clean voice Sm is multiplied by multiplicative distortions Hm ,that is, Rm(lin)=Sm(lin)Hm(lin). Also, if it is expressed in terms of cepstrum domain (cep), the same is expressed by a sum of the cepstrum of clean voice Sm and multiplicative distortions Hm, that is, Rm(cep)=Sm(cep)+Hm(cep).
Further, if the voice HMM 10 is expressed in terms of linear spectral domain (lin), it is expressed by Rm(lin)=Sm(lin)Hm(lin), and if it is expressed in terms in the cepstrum domain (cep), it is expressed by Rm(cep)=Sm(cep)+Hm(cep).
The above-described initial noise HMM 17 is an acoustic model in which sound (corresponding to the additive noise) in a non-utterance period is collected as the initial noise data Nm, and is trained by using the initial noise data Nm, and the same is generated in advance by a process similar to that in the voice HMM generating section 5 shown in FIG. 3. Therefore, if the initial noise HMM 17 is expressed in terms of the linear spectral domain (lin), it becomes Nm(lin), and if it is expressed in terms of cepstrum domain (cep), it becomes Nm(cep).
The initial combination HMM 16 is generated by the following process.
Voice (acoustic model) in the cepstrum domain (cep) Rm(cep)=Sm(cep)+Hm(cep) is provided from the voice HMM 10 to the mean calculation section 11 and subtracter 12, and the mean calculation section 11 obtains an estimated value Hm^(cep) of the multiplicative distortions by averaging the feature vector in the voice database for training the acoustic model and averaging the mean vectors of the voice HMM by using the CMN method, and provides it to the subtracter 12. Thereby, an operation of Rm(cep)−Hm^(cep) is carried out in the subtracter 12, and the subtracter 12 outputs the voice Sm′(cep) in which the estimated value Hm^(cep) of the multiplicative distortions is removed.
Herein, by making an approximation in which the estimated value Hm(cep) is almost equal to the multiplicative distortions Hm(cep), it is assumed that voice Sm′(cep) free from any multiplicative distortion has been obtained.
Next, an inverse cepstrum converting section 13 converts the voice Sm′(cep) in the cepstrum domain to voice Sm′(lin) in the linear spectral domain and provides the same to an adder 14, and simultaneously, an inverse cepstrum converting section 18 converts the initial noise Nm(cep) (acoustic model of the initial noise) in the cepstrum domain, which is outputted from the initial noise HMM 17 to an initial noise Nm(lin) in the linear spectral domain and provides the same to the adder 14, whereby the adder 14 generates additive noise added voice Rm′(lin)=Sm′(lin)+Nm(lin) by adding the voice Sm′(lin) to initial noise Nm(lin) in the linear spectral domain, and provides the same to a cepstrum converting section 15.
And, the cepstrum converting section 15 converts the additive noise added voice Rm′(lin) to the additive noise added voice Rm′(cep) in the cepstrum domain, and generates the initial combination HMM 16.
Accordingly, the initial combination HMM 16 is made of an acoustic model that is characterized by the additive noise added voice Rm′(cep). The acoustic model is expressed as described below:                                                                                           Rm                  ′                                ⁡                                  (                  cep                  )                                            =                            ⁢                              cep                ⁡                                  [                                                                                    cep                                                  -                          1                                                                    ⁡                                              [                                                                              Sm                            ⁡                                                          (                              cep                              )                                                                                +                                                      Hm                            ⁡                                                          (                              cep                              )                                                                                -                                                      Hm                            ^                                                          (                              cep                              )                                                                                                      ]                                                              +                                          Nm                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                                              ]                                                                                                        =                            ⁢                              cep                ⁡                                  [                                                                                    Sm                        ′                                            ⁡                                              (                        lin                        )                                                              +                                          Nm                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                                              ]                                                                                                        ≈                            ⁢                              cep                ⁡                                  [                                                            Sm                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                              +                                          Nm                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                                              ]                                                                                        (        1        )            
Also, in the above described expression, cep[ ] expresses cepstrum conversion that is carried out in the cepstrum converting section 15, and cep−1[ ] expresses inverse cepstrum conversion that is carried out by the inverse cepstrum converting sections 13 and 18.
Next, a description is given of the functions of the Jacobian matrix calculating section 19. As described above, the Jacobian matrix calculating portion 19 is provided in order to reduce the amount of calculation. Where it is assumed that a variation ΔNm(cep)=Na(cep)−Nm(cep) between the additive noise Na(cep) in actual use environments and the initial Nm(cep) in the initial noise HMM 17 is slight, a variation in a combined model corresponding to the variation ΔNm(cep) of the noise spectrum is obtained by the Taylor expansion, wherein the initial combination model 16 is compensated according to the obtained variation. And, the acoustic model obtained by the compensation is made into an adaptive HMM 26.
Speaking in further detail, the linear spectrum Rm(lin) is as follows:Rm(lin)=Sm(lin)+Nm(lin)  (2)
where Sm(lin)is the linear spectrum of clean voice Sm not including the multiplicative distortions and additive noise;
Rm(lin) is the linear spectrum of the voice Rm in which no multiplicative distortion is included, but the additive noise is included; and
Nm(lin) is the linear spectrum of the additive noise Nm.
Also, if the voice Rm including the additive noise is expressed in terms in the cepstrum domain;Rm(cep)=IDCT[log(exp(DCT[Sm(cep)])+exp(DCT[Nm(cep)]))]  (3)
Herein, IDCT [ ] is discrete inverse cosine transform, DCT [ ] is discrete cosine transform, log ( ) is logarithm conversion, and exp ( ) is exponential conversion.
Suppose the clean voice Sm does not vary bu the additive noise varies from Mm to Na in the actual utterance environment, a variation in the initial combination model ΔRm(cep) which is the difference between Rmc(cep) that is the voice containing the Na and the Rm(cep) that is the voice containing Nm in the cepstrum domain can be approximated by the first derivative term of the Taylor expansion of the expression (3) as shown in the followings expression (4).                               Δ          ⁢                                          ⁢          Rm          ⁢                                          ⁢                      (            cep            )                          =                                                                              ∂                  Rm                                ⁢                                                                  ⁢                                  (                  cep                  )                                                                              ∂                  Nm                                ⁢                                                                  ⁢                                  (                  cep                  )                                                      ⁢            Δ            ⁢                                                  ⁢            Nm            ⁢                                                  ⁢                          (              cep              )                                =                      J            ⁡                          (                              Δ                ⁢                                                                  ⁢                Nm                ⁢                                                                  ⁢                                  (                  cep                  )                                            )                                                          (        4        )            
Where ∂Rm(cep)/∂Nm(cep) is a Jacobian matrix and ΔNm(cep)=Na(cep)−Nm(cep) is the difference between the additive noise in the actual utterance environment and that in the initial noise in the cepstrum domain.
The expression (4) is also expressed as shown in the following expression (5).                                                                         Rmc                ⁢                                                                  ⁢                                  (                  cep                  )                                            =                                                Rm                  ⁢                                                                          ⁢                                      (                    cep                    )                                                  +                                                                                                    ∂                        Rm                                            ⁢                                                                                          ⁢                                              (                        cep                        )                                                                                                            ∂                        Nm                                            ⁢                                                                                          ⁢                                              (                        cep                        )                                                                              ⁢                                      (                                                                  Na                        ⁢                                                                                                  ⁢                                                  (                          cep                          )                                                                    -                                              Nm                        ⁢                                                                                                  ⁢                                                  (                          cep                          )                                                                                      )                                                                                                                          =                              IDCT                ⁡                                  [                                      log                    ⁢                                                                                  ⁢                                          (                                                                        exp                          ⁢                                                                                                          ⁢                                                      (                                                          DCT                              ⁢                                                                                                                          [                                                              Sm                                ⁢                                                                                                                                  ⁢                                                                  (                                  cep                                  )                                                                                            ]                                                        )                                                                          +                                                  exp                          ⁢                                                                                                          ⁢                                                      (                                                          DCT                              ⁢                                                                                                                          [                                                              Na                                ⁢                                                                                                                                  ⁢                                                                  (                                  cep                                  )                                                                                            ]                                                        )                                                                                              )                                                        ]                                                                                        (        5        )            
An element of I-th row and j-th column of the Jacobian matrix, [J]ij, is calculated by the following expression (6).                                           [            J            ]                    ij                =                              ∑                          k              =              1                        P                    ⁢                                                                                          Rm                    ′                                    ⁡                                      (                    cep                    )                                                  k                                            Nm                ⁢                                                                  ⁢                                                      (                    cep                    )                                    k                                                      ⁢                          F              ik                              -                1                                      ⁢                          F              kj                                                          (        6        )            
Where Fkj is a k-th row j-th column element of a cosine transform matrix and Fik−1 is an I-th row k-th column element of an inverse cosine transform matrix.
Therefore, the Jacobian matrix calculating portion 19 can calculate the Jacobian matrix according to the expression (6) in advance by using the additive noise added voice Rm(lin) in the linear spectral domain that is received from the adder 14 and the initial noise Nm(lin) in the linear spectral domain that is received from the inverse cepstram converting section 18.
The initial combination HMM 16 is adaptively compensated on the basis of the additive noise that is produced in actual utterance environment. The variation in the initial combination model can be obtained by multiplying the variation ΔNm(cep) between the additive noises by the Jacobian Matrix J. Thus, it is possible to generates an adaptive model by adding the variation in the initial combination model to the initial combination model.
Next, a description is given of a process for generating an adaptive HMM 26.
As the utterance start switch (not illustrated) equipped with the voice recognition system is turned on by an user, a microphone (not illustrated) collects utterance voices and the frame-by-frame fragmenting section 20 fragments data Ra of the utterance voices, in units of an appointed duration of time. Further, the cepstrum operating section 21 processes the data Ra into utterance voice data Ra (cep) in the cepstrum domain (cep).
First, as the user turns on the above-described utterance start switch, a switch element 22 is changed over to the contact “a” side in a non-utterance period until utterance actually starts. Therefore, cepstrum Na(cep) of the background noise (additive noise) Na in an environment where the user attempts to utter is provided into a subtracter 23 through the switch element 22.
The subtracter 23 subtracts the cepstrum Nm(cep) of the initial noise Hm from the cepstrum Na(cep) of the background noise Na and provides the result Na(cep)−Nm(cep) of the subtraction to a multiplier 24, wherein the multiplier 24 multiplies the above-described result Na(cep)−Nm(cep) by the Jacobian matrix J, and provides the results of the multiplication J[Na(cep)−Nm(cep)] to an adder 25. The adder 25 adds the result J[Na(cep)−Nm(cep)] of the multiplication to the acoustic model Rm^(cep) of the initial combination HMM 16 in units of words or subwords, whereby an adaptive HMM 26 that has been adaptively compensated by the background noise Na in actual utterance environments is generated. That is, if the adaptive HMM 26 is expressed in terms in the cepstrum domain (cep), the following expression is established;                                                                         Radp                ⁢                                                                  ⁢                                  (                  cep                  )                                            =                            ⁢                                                                    Rm                    ′                                    ⁡                                      (                    cep                    )                                                  +                                  J                  ⁡                                      [                                                                  Na                        ⁢                                                                                                  ⁢                                                  (                          cep                          )                                                                    -                                              Nm                        ⁢                                                                                                  ⁢                                                  (                          cep                          )                                                                                      ]                                                                                                                          ≈                            ⁢                              cep                ⁢                                                                  [                                                      Sm                    ⁡                                          (                      lin                      )                                                        +                                      Na                    ⁡                                          (                      lin                      )                                                                      ]                                                                        (        7        )            
Also, in the expression (7), cep[ ] expresses cepstrum conversion.
Thus, as the adaptive HMM 26 is generated, the switch element 22 is changed over to the contact “b” side, and utterance voice Ra to be recognized is inputted as the utterance voice Ra(cep) in the cepstrum domain. Herein, if it is assumed that the utterance voice Ra contains components Sa(lin), Ha(lin), and Na(lin) of the linear spectrum of clean voice Sa, multiplicative distortions Ha, and additive noise Na, the utterance voice Ra(cep) in the cepstrum domain is expressed by:
Ra(cep)=cep[Sa(lin)Ha(lin)+Na(lin)].
The mean calculation section 27 obtains an estimated value Ha^(cep) of the multiplicative distortions Ha(cep) by the CMN method, and the subtracter 28 subtracts the estimated value Ha^(cep) from the utterance voice Ra(cep), wherein the result Ra(cep)−Ha^(cep) of the subtraction is provided to the collating section 29 as observation value series RNa(cep).
And, the collating section 29 collates the observation value series RNa(cep) with the adaptive HMM 26 in terms of words or subwords, and outputs an adaptive HMM having the highest likelihood with respect to the observation value series RNa(cep) as the result of recognition. That is, the observation value series RNa(cep) is expressed by the expression below.                                                                         R                ⁢                                                                  ⁢                                  Na                  ⁡                                      (                    cep                    )                                                              =                            ⁢                                                Ra                  ⁢                                                                          ⁢                                      (                    cep                    )                                                  -                                  Ha                  ^                                      (                    cep                    )                                                                                                                          =                            ⁢                              cep                ⁢                                                                  [                                                                            Sa                      ⁢                                                                                          ⁢                                              (                        lin                        )                                            ⁢                                                                                          ⁢                      Ha                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                                                    Ha                      ^                                              (                        lin                        )                                                                              +                                                            Na                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                                                    Ha                      ^                                              (                        lin                        )                                                                                            ]                                                                                        ≈                            ⁢                              cep                ⁢                                                                  [                                                      Sa                    ⁢                                                                                  ⁢                                          (                      lin                      )                                                        +                                                            Na                      ⁢                                                                                          ⁢                                              (                        lin                        )                                                                                    Ha                      ^                                              (                        lin                        )                                                                                            ]                                                                        (        8        )            
The voice recognition is carried out by collating the feature vectors RNa(cep) of the observation value series, which are expressed by the above-described expression (8) with those of the adaptive HMM 26 Radp (cep) expressed by the above-described expression (7).
However, in the voice recognition system filed by the present inventor, which is shown in FIG. 4, voice recognition is carried out by collating the adaptive HMM 26 with the observation value series RNa(cep) of the utterance voice. However, there is a problem in that the adaptive HMM 26 has not been established yet as a sufficient model with respect to the observation value series RNa(cep).
That is, where the above-described expression (7) is compared with the above-described expression (8), the adaptive HMM 26 is featured by adding linear spectrum Na(lin) of the additive noise to the linear spectrum Sm(lin) of clean voice and converting the same into cepstrum domain. However, the observation value series RNa(cep) is featured by adding the ratio Na(lin)/Ha^(lin) of the linear spectrum Na(lin) of the additive noise to the linear spectrum Ha^(lin) of multiplicative distortions to the linear spectrum Sa(lin) of clean voice and converting the same into cepstrum domain.
Therefore, the adaptive HMM 26 is not such a model by which influences due to multiplicative distortions can be completely removed. Therefore, where the collating section 29 collates the adaptive HMM 26 with the observation value series RNa(cep), a case occurs where the adaptive HMM 26 does not model the observation value series RNa(cep) adequately. Finally, there is a problem in that the ratio of the voice recognition is not improved.