1. Field
Embodiments of the invention relates in general to voice activity detection, and more specifically, to discriminating between event types, such as speech and noise.
2. Background
Voice activity detection (VAD) is an essential part in many speech processing tasks such as speech coding, hands-free telephony and speech recognition. For example, in mobile communication the transmission bandwidth over the wireless interface is considerably reduced when the mobile device detects the absence of speech. A second example is automatic speech recognition system (ASR). VAD is important in ASR, because of restrictions regarding memory and accuracy. Inaccurate detection of the speech boundaries causes serious problems such as degradation of recognition performance and deterioration of speech quality.
VAD has attracted significant interest in speech recognition. In general, two major approaches are used for designing such a system: threshold comparison techniques and model based techniques. For the threshold comparison approach, a variety of features like, for example, energy, zero crossing, autocorrelations coefficients, etc. are extracted from the input signal and then compared against some thresholds. Some approaches can be found in the following publications: Li, Q., Zheng, J., Zhou, Q., and Lee, C.-H., “A robust, real-time endpoint detector with energy normalization for ASR in adverse environments,” Proc. ICASSP, pp. 233-236, 2001; L. R. Rabiner, et al., “Application of an LPC Distance Measure to the Voiced-Unvoiced-Silence Detection Problem,” IEEE Trans. On ASSP, vol. ASSP-25, no. 4, pp. 338-343, August 1977.
The thresholds are usually estimated from noise-only and updated dynamically. By using adaptive thresholds or appropriate filtering their performance can be improved. See, for example, Martin, A., Charlet, D., and Mauuary, L, “Robust Speech/Nonspeech Detection Using LDA applied to MFCC,” Proc. ICASSP, pp. 237-240, 2001; Monkowski, M., Automatic Gain Control in a Speech Recognition System, U.S. Pat. No. 6,314,396; and Lie Lu, Hong-Jiang Zhang, H. Jiang, “Content Analysis for Audio Classification and Segmentation,” IEEE Trans. Speech & Audio Processing, Vol. 10, N0.7, pp. 504-516, October 2002.
Alternatively, model based VAD were widely introduced to reliably distinguish speech from other complex environment sounds. Some approaches can be found in the following publications: J. Ajmera, I. McCowan, “Speech/Music Discrimination Using Entropy and Dynamism Features in a HMM Classification Framework,” IDIAP-RR 01-26, IDIAP, Martigny, Switzerland 2001; and T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, “Segment Generation and Clustering in the HTK Broadcast News Transcription System”, DARPA Broadcast News Transcription und Understanding Workshop, pp. 133-137, 1998. Features such us full band energy, sub-band energy, linear prediction residual energy or frequency based features like Mel Frequency Cepstral Coefficients (MFCC) are usually employed in such systems.
Threshold adaptation and energy features based VAD techniques fail to handle complex acoustic situations encountered in many real life applications where the signal energy level is usually highly dynamic and background sounds such as music and non-stationary noise are common. As a consequence, noise events are often recognized as words causing insertion errors while speech events corrupted by the neighbouring noise events cause substitution errors. Model based VAD techniques work better in noisy conditions, but their dependency on one single language (since they encode phoneme level information) reduces their functionality considerably.
The environment type plays an important role in VAD accuracy. For instance, in a car environment where high signal to noisy ratio (SNR) conditions are commonly encountered when the car is stationary an accurate detection is possible. Voice activity detection remains a challenging problem when the SNR is very low and it is common to have high intensity semi-stationary background noise from the car engine and high transient noises such as road bumps, wiper noise, door slams. Also in other situations, where the SNR is low and there is background noise and high transient noises, voice activity detection is challenging.
It is therefore highly desirable to develop a VAD method/system which performs well for various environments and where robustness and accuracy are important considerations.