1. Field of the Invention
The present invention relates to a voice activity detector, and a process for detecting a voice signal.
2. Description of the Related Art
In a number of speech processing applications it is important to determine the presence or absence of a voice component in a given signal, and in particular, to determine the beginning and ending of voice segments. Detection of simple energy thresholds has been used for this purpose, however, satisfactory results only tend to be obtained where relatively high signal to noise ratios are apparent in the signal.
Voice activity detection generally finds applications in speech compression algorithms, karaoke systems and speech enhancement systems. Voice activity detection processes typically dynamically adjust the noise level detected in the signals to facilitate detection of the voice components of the signal.
The International Telecommunication Union (ITU) prescribes the following standards for a voice activity detector (VAD):
1. ITU-T G.723.1 Annex A, Series G: Transmission Systems and Media, “Silence compression scheme”, 1996.
2. ITU-T G.729 Annex B, Series G: Transmission Systems and Media, “A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70”, 1996.
The European Telecommunication Standards Institute (ETSI) prescribes the following standard for a VAD:
1. ETSI EN 301 708 V7.1.1, Digital cellular telecommunications system (Phase 2+); “Voice Activity Detector (VAD) for adaptive Multi-Rate (AMR) speech traffic channels: general description”, 1999.
The basic function of the ETSI VAD is to indicate whether each 20 ms frame of an input signal sampled at 16 kHz contains data that should be transmitted, i.e., speech, music or information tones. The ETSI VAD sets a flag to indicate that the frame contains data that should be transmitted. A flow diagram of the processing steps of the ETSI VAD is shown in FIG. 1. The ETSI VAD uses parameters of the speech encoder to compute the flag.
The input signal is initially pre-emphasized and windowed into frames of 320 samples. Each windowed frame is then transformed into the frequency domain using a Discrete Time Fourier Transform (DTFT).
The channel energy estimate for the current sub-frame is then calculated based on the following:
1. the minimum allowable channel energy;
2. a channel energy smoothing factor;
3. the number of combined channels; and
4. elements of the respective low and high channel combining tables.
The channel Signal to Noise Ratio (SNR) vector is used to compute the voice metrics of the input signal. The instantaneous frame SNR and the long-term peak SNR are used to calibrate the responsiveness of the ETSI VAD decision.
The quantized SNR is used to determine the respective voice metric threshold, hangover count and burst count threshold parameters. The ETSI VAD decision can then be made according to the following process:
If ( v(m)>v th + μ(m) ){  /  *if the voice metric > voice metric threshold*/VAD(m)=ONB(m)=b(m−1)+1   /* increment burst counter*/If ( b(m)>b th ){   /*compare counter with threshold */h(m)=h cnt  /* set hangover*/}}else{b(m) = 0      /* clear burst counter */h(m)=h(m−1) −1   /* decrement hangover /if ( (h(m) <= 0 ){  /* check for expired hangover */VAD(m)=OFFH(m)=0}else{  /* hangover not yet expired */VAD(m) = ON}}
To avoid being over-sensitive to fluctuating, non-stationary, background noise conditions, a bias factor may be used to increase the threshold on which the ETSI VAD decision is based. This bias factor is typically derived from an estimate of the variability of the background noise estimate. The variability estimate is further based on negative values of the instantaneous SNR. It is presumed that a negative SNR can only occur as a result of fluctuating background noise, and not from the presence of voice. Therefore, the bias factor is derived by first calculating the variability factor. The spectral deviation estimator is used as a safeguard against erroneous updates of the background noise estimate. If the spectral deviation of the input signal is too high, then the background noise estimate update may not be permitted.
The ETSI VAD needs at least 4 frames to give a reliable average speech energy with which the speech energy of the current data frame can be compared.
A typical problem faced by a VAD is misclassification of the input signal into voice/silence regions. Some standard algorithms vary the noise threshold dynamically across a number of frames and produce more accurate VAD estimates with time. However, the complexity of these VADs is relatively high. The complexity of the ETSI VAD may be given as follows:ETSI VAD={2·O(L)+O(M·log2(M)+4·O(Nc)} operationswhere                Nc is the number of combined channels;        L is the subframe length; and        M is the DFT length.        
Windowing and pre-emphasis both have an order of O(L). The Discrete Time Fourier Transform has an order of O(M·log2(M)). The channel energy estimator, Channel SNR estimator, voice metric calculator and Long-term Peak SNT calculator each have complexity of the order of O(Nc).
These VADs are typically not efficient for applications that require low-delay signal dependant estimation of voice/silence regions of speech. Such applications include pitch detection of speech signals for karaoke. If a noisy signal is determined to be a speech track, the pitch detection algorithm may return an erroneous estimate of the pitch of the signal. As a result, most of the pitch estimates will be lower than expected, as shown in FIG. 2. The ETSI VAD supports a low-delay VAD estimate based on a pre-fixed noise thresholds, however, these thresholds are not signal dependent.
An object of the present invention is to overcome or ameliorate one or more of the above mentioned difficulties, or at least provide a useful alternative.