For applications in which communication occurs in noisy environments, it may be desirable to separate a desired speech signal from background noise. Noise may be defined as the combination of all signals interfering with or otherwise degrading the desired signal. Background noise may include numerous noise signals generated within the acoustic environment, such as background conversations of other people, as well as reflections and reverberation generated from the desired signal and/or any of the other signals.
Signal activity detectors, such as voice activity detectors (VADs), can be used to minimize the amount of unnecessary processing in an electronic device. A voice activity detector may selectively control one or more signal processing stages following a microphone. For example, a recording device may implement a voice activity detector to minimize processing and recording of noise signals. The voice activity detector may de-energize or otherwise deactivate signal processing and recording during periods of no voice activity. Similarly, a communication device, such as a smart phone, mobile telephone, personal digital assistant (PDA), laptop, or any portable computing device, may implement a voice activity detector in order to reduce the processing power allocated to noise signals and to reduce the noise signals that are transmitted or otherwise communicated to a remote destination device. The voice activity detector may de-energize or deactivate voice processing and transmission during periods of no voice activity.
The ability of the voice activity detector to operate satisfactorily may be impeded by changing noise conditions and noise conditions having significant noise energy. The performance of a voice activity detector may be further complicated when voice activity detection is integrated in a mobile device, which is subject to a dynamic noise environment. A mobile device can operate under relatively noise free environments or can operate under substantial noise conditions, where the noise energy is on the order of the voice energy. The presence of a dynamic noise environment complicates the voice activity decision.
Conventionally, a voice activity detector classifies an input frame as background noise or active speech. The active/inactive classification allows speech coders to exploit pauses between the talk spurts that are often present in a typical telephone conversation. At a high signal-to-noise ratio (SNR), such as an SNR>30 dB, simple energy measures are adequate to accurately detect the voice inactive segments for encoding at minimal bit rates, thereby meeting lower bit rate requirements. However, at low SNRs, the performance of the voice activity detector degrades significantly. For example, at low SNRs, a conservative VAD may produce increased false speech detection, resulting in a higher average encoding rate. An aggressive VAD may miss detecting active speech segments, thereby resulting in loss of speech quality.
Most current VAD techniques use the long-term SNR to estimate a threshold (referred to as VAD_THR) to use in performing the VAD decision of whether the input frame is background noise or active speech. At low SNRs or under fast-varying non-stationary noise, the smoothed long-term SNR will produce an inaccurate VAD_THR, resulting in either increased probability of missed speech or increased probability of false speech detection. Also, some VAD techniques (e.g., Adaptive Multi-Rate Wideband or AMR-WB) work well for stationary type of noises such as car noise but produce a very high voice activity factor (due to extensive false detections) for non-stationary noise at low SNRs (e.g., SNR<15 dB).
Thus, the erroneous indication of voice activity can result in processing and transmission of noise signals. The processing and transmission of noise signals can create a poor user experience, particularly where periods of noise transmission are interspersed with periods of inactivity due to an indication of a lack of voice activity by the voice activity detector. Conversely, poor voice activity detection can result in the loss of substantial portions of voice signals. The loss of initial portions of voice activity can result in a user needing to regularly repeat portions of a conversation, which is an undesirable condition.