When automatic speech recognition technologies are used in a real-world environment, a period where a target speech signal is present should be estimated from an acoustic signal containing noise together with the target speech signal, and then, the noise should be removed.
FIG. 22 shows a functional configuration for implementing a conventional voice activity detection method disclosed in Non-Patent Literature 1, as a conventional voice activity detection apparatus 900, and the operation thereof will be described briefly. The voice activity detection apparatus 900 includes an acoustic signal analyzer 90, a speech state probability to non-speech state probability ratio calculator 95, and a voice activity detection unit 96. The acoustic signal analyzer 90 includes an acoustic feature extraction unit 91, a probability estimation unit 92, a parameter storage 93, and a GMM (Gaussian mixture model) storage 94. The parameter storage 93 includes an initial noise probabilistic model estimation buffer 930 and a noise probabilistic model estimation buffer 931. The GMM storage 94 includes a silence GMM storage 940 and a clean-speech GMM storage 941, which respectively have stored silence GMM and clean-speech GMM generated beforehand.
The acoustic feature extraction unit 91 extracts an acoustic feature Ot of a digital acoustic signal At containing a speech signal and a noise signal. As the acoustic feature, a logarithmic mel spectrum or a cepstrum can be used, for example. The probability estimation unit 92 generates a non-speech GMM and a speech GMM adapted to a noise environment, by using a silence GMM and a clean-speech GMM, and calculates the non-speech probabilities of all the Gaussian distributions in the non-speech GMM and the speech probabilities of all the Gaussian distributions in the speech GMM, corresponding to the input acoustic feature Ot.
The speech state probability to non-speech state probability ratio calculator 95 calculates a speech state probability to non-speech state probability ratio by using the non-speech probabilities and the speech probabilities. The voice activity detection unit 96 judges from the speech state probability to non-speech state probability ratio whether the input acoustic signal is in a speech state or in a non-speech state and outputs just the acoustic signal Ds in the speech state, for example.
In the conventional voice activity detection method, all of the Gaussian distributions in the GMMs are used to estimate a speech period, as described above. All of the Gaussian distributions are used because all of them are considered to be important. This idea is shown as methods of voice activity detection and noise suppression in Non-Patent Literature 2, for example. The idea of using all Gaussian distributions is clearly indicated also by the following expression (1) for calculating the filter gain of a noise suppression filter, given in Non-Patent Literature 2.
                                          G            ^                                t            ,            1                          =                              ∑                          j              =              0                        1                    ⁢                                    α                              j                ,                t                                      ⁢                                          ∑                                  k                  =                  1                                K                            ⁢                                                p                  ⁡                                      (                                          k                      |                                              O                                                  t                          ,                          j                                                                                      )                                                  ⁢                                                      G                    ^                                                        t                    ,                    j                    ,                    k                    ,                    1                                                                                                          (        1        )            
Here, p(k|Ot,j) is the output probability of a k-th Gaussian distribution, and K represents the total number of distributions.