1. Field of the Invention
The present invention relates to a voice recognition system, and more particularly to a voice recognition system in which the detection precision of the voice section is improved. As used herein, voice recognition means speech recognition.
2. Description of the Related Art
In the voice recognition system, when the voice uttered in noisy environments, for example, is directly subjected to voice recognition, the voice recognition ratio may be degraded due to the influence of noise. Therefore, it is firstly important to correctly detect a voice section to make the voice recognition.
The conventional well-known voice recognition system for detecting the voice section using a vector inner product was configured as shown in FIG. 4.
This voice recognition system creates an acoustic model (voice HMM) in units of word or subword (e.g., phoneme or syllable), employing an H (Hidden Markov Model), produces a series of observed values that is a time series of Cepstrum for an input signal if the voice to be recognized is uttered, collates the series of observed values with the voice HMM, and selects the voice HMM with the maximum likelihood which is then output as the recognition result.
More specifically, a large quantity of voice data Sm collected and stored in a training voice database is partitioned in a unit of frame for a predetermined period (about 10 to 20 msec), time series of Cepstrum is acquired by making Cepstrum operation on each data of frame unit successively, further this time series of Cepstrum are trained as a feature quantity of voice, and reflected to the parameters of an acoustic model (voice HMM), whereby the voice HMM in a unit of word or subword is produced.
Also, a voice section detection section for detecting the voice section comprises the acoustic analyzers 1, 3, an eigenvector generating section 2, an inner product operation section 4, a comparison section 5, and a voice extraction section 6.
Herein, an acoustic analyzer 1 makes acoustic analysis of voice data Sm in the training voice database for every frame number n to generate an M-dimensional feature vector xn=[xn1 xn2 xn3 . . . xnM]T. Here, T denotes the transposition.
The eigenvector generation section 2 generates a correlation matrix R represented by the following expression (1) from the M-dimensional feature vector xn, and the correlation matrix R is expanded into eigenvalues by solving the following expression (2) to obtain an eigenvector (called a trained vector) V.
                    R        =                              1            N                    ⁢                                    ∑                              n                =                1                            N                        ⁢                                                  ⁢                                          X                n                            ⁢                              X                n                T                                                                        (        1        )                                                      (                          R              -                                                λ                  k                                ⁢                I                                      )                    ⁢                      v            k                          =        0                            (        2        )                            where k=1, 2, 3, . . . , M;        I denotes a unit matrix; and        0 denotes a zero vector.        
Thus, the trained vector V is calculated beforehand on the basis of the training voice data Sm. If the input signal data Sa is actually produced when the voice is uttered, the acoustic analysis section 3 analyzes the input signal Sa to generate a feature vector A. The inner product operation section 4 calculates the inner product of the trained vector V and the feature vector A. Further, the comparison section 5 compares the inner product value VTA with a fixed threshold θ, and if the inner product value VTA is greater than the threshold θ, the voice section is determined.
And the voice extraction section 6 is turned on (conductive) during the voice section determined as described above, and extracts data Svc for voice recognition from the input signal Sa, and generate a series of observed values to be collated with the voice HMM.
By the way, with the conventional method for detecting the voice section using the vector inner product, the threshold θ is fixed at zero (θ=0). And if the inner product value VTA between the feature vector A of the input signal Sa obtained under the actual environment and the trained vector V is greater than the fixed threshold θ, the voice section is determined.
Therefore, in the case where the voice is uttered in the less noisy background, considering the relation among the feature vector of noise (noise vector) in the input signal obtained under the actual environment, the feature vector of proper voice (voice vector), the feature vector A of input signal obtained under the actual environment, and the trained vector V in a linear spectral domain, the noise vector is small, and the voice vector of proper voice is dominant, as shown in FIG. 5A, whereby the feature vector A of input signal obtained under the actual environment points to the same direction as the voice vector and the trained vector V.
Accordingly, the inner product value VTA between the feature vector A and the trained vector V is a positive (plus) value, whereby the fixed threshold θ(=0) can be employed as the determination criterion to detect the voice section.
However, in a place where there is a lot of noise with lower S/N ratio, for example, within a chamber of the vehicle, the noise vector is dominant, and the voice vector is relatively smaller, so that the feature vector A of input signal obtained under the actual environment is an opposite direction to the voice vector and the trained vector V, as shown in FIG. 5B. Accordingly, the inner product value VTA between the feature vector A and the trained vector V is a negative (minus) value, whereby there is the problem that the fixed threshold θ(=0) can not be employed as the determination criterion to detect the voice section correctly.
In other words, if the voice recognition is made in the place where there is a lot of noise with lower S/N ratio, the inner product value VTA between the feature vector A and the trained vector V is a negative value (VTA<θ) even when the voice section should be determined, resulting in the problem that the voice section can not be correctly detected, as shown in FIG. 5C.