Voice activity detection technology is widely used for various purposes. For example, the voice activity detection technology is used in mobile communications, etc. for improving the voice transmission efficiency by increasing the compression ratio of the non-active voice frames or by precisely leaving out transmission of the non-active voice frames. Further, the voice activity detection technology is widely used in noise cancellers, echo cancellers, etc. for estimating or determining the noise level in the non-active voice frames, in sound recognition systems (voice recognition systems) for improving the performance and reducing the workload, etc.
Various devices for detecting the active voice segments have been proposed (see Patent Documents 1 and 2, for example). An active voice segment detecting device described in the Patent Document 1 extracts active voice frames, calculates a first fluctuation (first variance) by smoothing the voice level, calculates a second fluctuation (second variance) by smoothing fluctuations in the first fluctuation, and judges whether each frame is an active voice frame or a non-active voice frame by comparing the second fluctuation with a threshold value. The threshold value is a previously set value. Further, the active voice segment detecting device determines active voice segments (based on the duration of active voice/non-active voice frames) according to the following judgment conditions:
Condition (1): An active voice segment that did not satisfy a minimum necessary duration is not accepted as an active voice segment. The minimum necessary duration will hereinafter be referred to as an “active voice duration threshold”.
Condition (2): A non-active voice segment sandwiched between active voice segments and satisfying (shorter than) duration for being handled as a continuous active voice segment is integrated with the active voice segments at both ends to make one active voice segment. The “duration for being handled as a continuous active voice segment” will hereinafter be referred to as a “non-active voice duration threshold” since the segment is regarded as a non-active voice segment if its duration is the non-active voice duration threshold or longer.
Condition (3): A prescribed number of frames adjoining the starting/finishing end of an active voice segment and having been judged as non-active voice segments due to their low fluctuation values are added to the active voice segment. The prescribed number of frames added to the active voice segment will hereinafter be referred to as “starting/finishing end margins”.
An active voice frame detection device described in Patent Document 2 comprises various types of feature quantity calculating units for calculating multiple types of feature quantities for each frame of voice data, a feature quantity integrating unit for calculating an integrated score by weighting the feature quantities, and an active voice frame discriminating unit for making a discrimination between an active voice frame and a non-active voice frame for each frame of the voice data based on the integrated score. The active voice frame detection device further comprises a reference data storage unit and a labeled data generating unit for preparing labeled data (in which each frame is provided with a label indicating whether the frame is an active voice frame or a non-active voice frame) and an initialization control unit and a weight updating unit for learning the weighting (weights) of the multiple types of feature quantities by using the labeled data as learning data so that the discrimination error rate of the active voice frame discriminating unit satisfies a standard. The weight learning is executed by use of a loss function (defining a loss increasing with the increase in the errors in the discrimination between active voice frames and non-active voice frames) so as to reduce the value of the loss function.
As the voice feature quantities, the active voice frame detection device described in the Patent Document 2 employs the amplitude level of the active voice waveform, a zero crossing number (how many times the signal level crosses 0 in a prescribed time period), spectral information on the sound signal, a GMM (Gaussian Mixture Model) log likelihood, etc.
Various feature quantities are described also in Non-patent Documents 1-3. For example, the value of the SNR (Signal to Noise Ratio) is described in the paragraph 4.3.3 of Non-patent Document 1 and the average of the SNR is described in the paragraph 4.3.5 of the Non-patent Document 1. The zero crossing number is described in the paragraph B.3.1.4 of Non-patent Document 2. A likelihood ratio employing an active voice GMM and a non-active voice GMM is described in Non-patent Document 3.