Voice activity detection technology is widely used for various purposes. For example, the voice activity detection technology is used in mobile communications, etc. for improving the voice transmission efficiency by increasing the compression ratio of the non-active voice segments or by precisely leaving out transmission of the non-active voice segments. Further, the voice activity detection technology is widely used in noise cancellers, echo cancellers, etc. for estimating or determining the noise level in the non-active voice segments, in sound recognition systems (voice recognition systems) for improving the performance and reducing the workload, etc.
Various devices for detecting the active voice segments have been proposed (see Patent Documents 1 and 2, for example). An active voice segment detecting device described in the Patent Document 1 extracts active voice frames, calculates a first fluctuation (first variance) by smoothing the voice level, calculates a second fluctuation (second variance) by smoothing fluctuations in the first fluctuation, and judges whether each frame is an active voice frame or a non-active voice frame by comparing the second fluctuation with a threshold value. Further, the active voice segment detecting device determines active voice segments (based on the duration of active voice/non-active voice frames) according to the following judgment conditions:
Condition (1): An active voice segment that did not satisfy a minimum necessary duration is not accepted as an active voice segment. The minimum necessary duration will hereinafter be referred to as an “active voice duration threshold”.
Condition (2): A non-active voice segment sandwiched between active voice segments and satisfying (shorter than) duration for being handled as a continuous active voice segment is integrated with the active voice segments at both ends to make one active voice segment. The “duration for being handled as a continuous active voice segment” will hereinafter be referred to as a “non-active voice duration threshold” since the segment is regarded as a non-active voice segment if its duration is the non-active voice duration threshold or longer.
Condition (3): A prescribed number of frames adjoining the starting/finishing end of an active voice segment and having been judged as non-active voice segments due to their low fluctuation values are added to the active voice segment. The prescribed number of frames added to the active voice segment will hereinafter be referred to as “starting/finishing end margins”.
In the active voice segment detecting device described in the Patent Document 1, the threshold value used for the judgment on whether each frame is an active voice frame or a non-active voice frame and the parameters (active voice duration threshold, non-active voice duration threshold, etc.) regarding the above conditions are previously set values.
Meanwhile, an active voice segment detection device described in the Patent Document 2 employs the amplitude level of the active voice waveform, a zero crossing number (how many times the signal level crosses 0 in a prescribed time period), spectral information on the sound signal, a GMM (Gaussian Mixture Model) log likelihood, etc. as voice feature quantities.