Voice Activity Detection (VAD) is the art of detecting the presence of speech activity in noisy audio signals that are supplied to a microphone of a communication system. VAD systems are used in many signal processing systems for telecommunication. For example, in the Global System for Mobile communication (GSM), traffic handling capacity is increased by having the speech coders employ VAD as part of an implementation of the Discontinuous Transmission (DTX) principle, as described in the GSM specifications (particularly in GSM 06.10--fullrate speech transcoding; and in GSM 06.31--Discontinuous Transmission (DTX) for full rate speech traffic channel, May 1994). In noise suppression systems, such as in spectral subtraction based methods, VAD is used for indicating when to start noise estimation (and noise parameter adaptation). In noisy speech recognition, VAD is also used to improve the noise robustness of a speech recognition system by adding the right amount of noise estimate to the reference templates.
Next generation GSM handsfree functions are planned that will integrate a noise reduction algorithm for high quality voice transmission through the GSM network. A crucial component for a successful background noise reduction algorithm is a robust voice activity detection algorithm. The GSM-VAD algorithm has been chosen for use in the next generation hands-free noise suppression algorithms to detect the presence or absence of speech activity in the noisy audio signal coming from the microphone. If one designates s(n) as a pure speech signal, and v(n) as the background noise signal, then the microphone signal samples, x(n), during speech activity will be: EQU x(n)=s(n)+v(n), (I)
and the microphone signal samples during periods of no speech activity will be: EQU x(n)=v(n). (II)
The detection of states (I) and (II) described in the above equations is not trivial, especially when the speech/noise ratio (SNR) values of x(n) are low, such as occur in a car environment while driving on a highway.
The GSM VAD algorithm generates information flags indicating which state the current frame of audio signal is classified in. Detection of the above two states is useful in spectral subtraction algorithms, which estimate characteristics of background noise in order to improve the signal to noise ratio without the speech signal being distorted. See, for example, S. F. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. on ASSP, pp. 113-120, vol. ASSP-27 (1979); J. Makhoul & R. McAulay, Removal of Noise From Noise-Degraded Speech Signals, National Academy Press, Washington, D.C. (1989); A. Varga, et al., "Compensation Algorithms for HMM Based Speech Recognition Algorithms", Proceedings of ICASSP-88, pp. 481-485, vol. 1 (1988); and P. Handel, "Low Distortion Spectral Subtraction for Speech Enhancement", Proceedings of EUROSPEECH Conf., pp. 1549-1553, ISSN 1018-4074 (1995).
The GSM VAD algorithm utilizes an autocorrelation function (ACF) and periodicity information obtained from a speech coder for its operation. As a consequence, it is necessary to run the speech coder before getting any noise-suppression performed. This situation is illustrated in FIG. 1. The digitized microphone signal samples, x(k), are supplied to a speech coder 101, which in turn generates autocorrelation coefficients (ACF) and long term predictor lag values (pitch information), N.sub.p, as specified by GSM 06.10. The ACF and N.sub.p signals are supplied to a VAD 103. The VAD 103 generates a VAD decision that is supplied to one input of a spectral subtraction-based adaptive noise suppression (ANS) unit 105. A second input of the ANS 105 receives a delayed version of the original microphone signal samples, x(n). The output of the ANS 105 is a noise-reduced signal that is then supplied to a second speech coder 107, or fed back to speech coder 101 for coding and transmission of the speech information.
From the above discussion, it is apparent that the GSM VAD algorithm disadvantageously requires the execution of the whole speech coder in order to be able to extract the short term autocorrelation and long term periodicity information that is necessary for making the VAD decision.
The periodicity information in the speech coder is calculated by a long term predictor using cross correlation algorithms. These algorithms are computationally expensive and incur unnecessary delay in the hands-free signal processing. The requirement for a simple periodicity detector gets more acute with the next generation coders (such as GSM's next generation Enhanced Full Rate (EFR) coder) which consume a large amount of memory and processing capacity (i.e., the number of instructions that need to be performed per second) and which add a significant computational delay compared to GSM's current Full Rate (FR) coders.
The utilization of the periodicity and ACF information from the speech coder 101 by the VAD decision in the noise reduction algorithm is a costly method with respect to delay, computational requirements and memory requirements. Furthermore, the speech coder has to be run twice before a successful voice transmission is achieved. The extraction of periodicity information from the signal is the most computationally expensive part. Consequently, a low complexity method for extracting the periodicity information in the signal is needed for efficient implementation of the background noise suppression algorithm in the mobile terminals and accessories of the future.
Conventional periodicity detectors are primarily based on analog processing of the signals, and fail to take into account the problems of material fading and slow processing time. They use computationally expensive techniques designed to process input signals that consist only of clean signals with no additive noise.
Other conventional periodicity detectors use the standard GSM type pitch detectors based on linear predictive coding (LPC) modeling of the input signal. These techniques, which suffer from the problems identified above, also fail to adapt the processing to the time varying nature of the signal, but instead use estimation model parameters (like the LPC order, frame length, and the like) that are not time-varying.
It is therefore desirable to provide voice activity detection without the aforementioned disadvantages.
The present invention provides voice activity detection without the aforementioned disadvantageous need for modeling information from speech coders.