Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
One function of a Voice Activity Detector (VAD) is to detect the presence or absence of human speech in the regions of audio signal recorded by a microphone. VAD plays a role in many speech processing systems, in the context that different processing mechanisms are used on the input signal regarding whether speech is present in it or not as decided by the VAD module. In these applications, accurate and robust VAD performance may affect overall performance. For example, in voice communication system DTX (discontinue transmission) is usually used to improve the bandwidth usage efficiency. In such a system, VAD is used to decide whether speech is present or not in the input signal and the actual transmission of speech signal is stopped if speech is not present. Here misclassification of speech as disturbance may result in speech drop-off in the transmitted signal, and affect its intelligibility. As an example, in a speech enhancement system it is generally required to estimate the level of the disturbance signal in the recorded signal. This is usually done with the help from a VAD where the disturbance level is estimated from the regions that contain disturbance signal only. See, for example, A. M. Kondoz, Digital Speech Coding for Low Bit Rate Communication Systems, ch. 11 (John Wiley & Sons, 2004). In this case, an inaccurate VAD may lead to either over-estimate or under-estimate of the disturbance level, which may eventually lead to suboptimal speech enhancement quality.
Various VAD systems have been previously proposed. See, for example, A. M. Kondoz, Digital Speech Coding for Low Bit Rate Communication Systems, ch. 10 (John Wiley & Sons, 2004). Some of these systems exploit the statistical aspects of the difference between the target speech and the disturbance, and rely on threshold comparison methods to differentiate that target speech from the disturbance signals. The statistical measurements that had been previously used in these systems include energy levels, timing, pitch, zero crossing rates, periodicity measurement, etc. Combination of more than one statistical measurement is used in more sophisticated systems to further improve the accuracy of the detection results. In general, statistical methods achieve good performance when the target speech and the disturbance have very distinguished statistical features, for example when the disturbance has a level that is steady, and lying below the level of the target speech. However, in a more adverse environment it becomes a very challenging task to maintain the good performance, in particular when the target signal level to disturbance level ratio is low or the disturbance signal has speech-like characteristics.
VAD in combination with a microphone array can also be found in some robust adaptive beamforming system designs. See, for example, O. Hoshuyama, B. Begasse, A. Sugiyama, and A. Hirano, “A real time robust adaptive microphone array controlled by an SNR estimate,” Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998. Those VAD are based the difference in the levels of different outputs of the microphone beamforming system, where the target signal is present only in one output and it is blocked for the other outputs. The effectiveness of such a VAD design may thus relate to the capability of the beamforming system in blocking the target signal for those outputs, which may be expensive to achieve in real-life systems.
Other references that may be pertinent to this background, but which are not to be considered prior art to the example inventive embodiments that will be described in the sections following, include:    Reference No. 1: A. M. Kondoz, Digital Speech Coding for Low Bit Rate Communication Systems, ch. 10, John Wiley & Sons, 2004;    Reference No. 2: A. M. Kondoz, Digital Speech Coding for Low Bit Rate Communication Systems, ch. 11, John Wiley & Sons, 2004;    Reference No. 3: J. G. Ryan and R. A. Goubran, “Optimal nearfield responses for microphone array,” in Proc. IEEE Workshop applicat. Signal Processing to Audio Acoust., New Paltz, N.Y., USA, 1997;    Reference No. 4: O. Hoshuyama, B. Begasse, A. Sugiyama, and A. Hirano, “A real time robust adaptive microphone array controlled by an SNR estimate,” Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, 1998;    Reference No. 5: US20030228023A1/WO03083828A1/CA2479758AA Multichannel voice detection in adverse environments; and    Reference No. 6: U.S. Pat. No. 7,174,022 Small array microphone for beam-forming and noise suppression.