Voice is carried over a digital telephone network, whether circuit- or packet-switched, by converting the analog signal to a digital signal. In the case of a packet-switched network, audio samples representing the digital signal are packetized, and the packetized samples sent electronically over the network. The packetized samples are received at the destination node, the samples de-packetized, and the analog signal recreated and provided to the other party.
While talking to another party, there are periods of time when neither party is talking. During such periods, background noise (which may include background voices) may be received by the telephone's microphone. Audio information, such as background noise, that is received during periods when no party to the call is speaking and when there is no audible call signaling, such as a tone, is referred to herein as “silence”.
Silence suppression is a process of not transmitting audio information over the network when one of the parties involved in a telephone call is not speaking, thereby reducing substantially bandwidth usage and assisting the identification of jitter buffer adjustment points. In a Voice over Internet Protocol (“VoIP”) system, Voice Activity Detection (“VAD”) or Speech Activity Detection (“SAD”) is used to dynamically monitor background noise, set appropriate speech detection thresholds and identify jitter buffer adjustment points. VAD detects, in audio signals or samples thereof, the presence or absence of human speech and, using this information, identifies silence periods. When silence suppression is in effect, the audio information received during such silence periods is not transmitted over the network to the other (destination) endpoint(s). Given that typically one party in a conversation speaks at any one time, silence suppression can achieve overall bandwidth savings in the order of 50% over the duration of a typical telephone call.
Distinguishing between voiced speech and background noise can be difficult. Moreover, VAD or SAD must occur very quickly to avoid clipping. To address these issues, a number of algorithms of differing degrees of complexity have been used. Examples include those based on energy thresholds (e.g., using the Signal-to-Noise Ratio or SNR), pitch detection, spectrum or spectral shape analysis, zero-crossing rate (e.g., determining how frequently the signal amplitude changes from positive to negative), periodicity measure, higher order statistics in the Linear Predictive Code or LPC residual domain (e.g., the energy of the predictive coding error or the residual increases when there is a mismatch between the shapes of the background and input signal), and combinations thereof.
In one common silence suppression scheme, the power of the signal is used as a consistent judgment to classify a signal into voice and silence segments. It is assumed that the power of the total signal in the presence of speech is sufficiently larger than that of background noise. A threshold value is used to mark the minimum SNR for a segment to be classified as voice-active. This threshold is known as the noise floor and is dynamically recalculated using the power of the signal. If the SNR of the signal falls within the threshold, it is considered to be voice-active. Otherwise, it is regarded as background noise. This behavior can be seen from FIG. 2 in which the amplitude waveform 200 of received audio signal, power waveform 204 of the received audio signal and noise floor power waveform 208 are depicted. The value of the noise floor is a smoothed representation of the signal waveform 200. The figure further shows the detected voice active and silence segments 212 and 216, respectively. As can be seen from FIG. 2, the noise floor waveform 208 trends upward when the signal includes speech segments 220 and 224 because of the large increase in signal power and downward immediately after the segments because of the large decrease in signal power. At the heart of this algorithm is its ability to adapt to changing background noise through its implementation of a time-varying noise floor.
The above VAD schemes can have difficulty detecting signals of substantially constant power, such as progress tones (e.g., intercept tones, ringback tones, busy tones, dial tones, reorder tones, and the like). Such schemes often identify such tones as background noise, which are not transmitted to the other endpoint. The problems with detecting a progress tone are shown by FIGS. 3A and 3B. FIG. 3A shows the progress tone as a sinusoidal waveform 300. FIG. 3B shows the tone expressed as a waveform 304 having a substantially constant power level. Because the noise floor is based on the power of the signal, when the signal has a substantially constant power the noise floor waveform 308 will approach the waveform 304. Using the VAD scheme noted above, the interval 312 would be properly diagnosed as being voice-active and therefore to be transmitted to the other endpoint while the interval 316 would be misdiagnosed as silence and therefore not to be transmitted to the other endpoint. At best, the other party would thus hear only part of the tone, which could cause him or her to believe that the telephone had malfunctioned. The misdiagnosis could further cause misadjustment of the jitter buffer (which could cause clicks and pops to be heard by the other person).
Fixed power signals can be reliably detected by more elaborate approaches, such as by analyzing the frequency spectrum of the signals using complex techniques like Fast Fourier Transform (FFT) and Cepstral Analysis. However, the required processing and memory cost of transforming the signal to the frequency domain is too high and processing time too long for such algorithms to be practical in a real-time application. Some of the techniques, such as FFT, introduce delay due to the need to build buffers (blocking) of input samples and/or use larger amounts of Random Access Memory (RAM) to store. A feasible solution must necessarily be time-based.
Threshold VADs are the most commonly used solution. Under the Energy Threshold method, the energy of the total signal in the presence of speech (which includes progress tones) is assumed to be larger than a preset threshold. A signal having an amplitude more than the threshold is deemed to be voice active regardless of the VAD conclusion. This approach, though preserving much progress tone information, makes assumptions that do not hold in some applications, resulting in poor accuracy rates. Statistical analysis of the signals has also been used, such as using Amplitude Probability Distribution as a means to ascertain noise level. But again, these methods are computationally expensive and not suitable for a VoIP gateway setting.
One algorithm that has been partially successful has been used in Avaya Inc.'s Crossfire™ gateway. The gateway uses the zero crossings rate method and exploits the time-based periodicity of a fixed power signal. Noise signals are assumed to be random by nature. The zero crossing rates for each frame are monitored. A constant zero crossing rate implies periodicity and thus a voice active segment. In other words, the periodicity of the various zero crossing points is determined and pattern matching techniques used to identify zero crossing behavior characteristic of a fixed power signal.
A similar zero-crossing algorithm is used in the G.729B extension for the G.729 speech coder standardized by ITU-T. Under the extension, selections are made every 10 milliseconds on speech frames consisting of 80 audio samples. Parameters extracted from the speech frames include full band energy, low band energy, Line Spectral Frequency (“LSF”) coefficients, and zero crossing rate. Differences between the four parameters extracted from the current frame and running averages of the noise are calculated for every frame. The differences represent noise characteristics. Large differences imply that the current frame is voice while the opposite implies that there is no voice present. The decision made by the VAD is based on a complex multi-boundary algorithm.
The problem with these methods is that a constant zero crossing rate does not always correspond to a periodic signal. A noise signal may cross a fixed line at a constant rate by chance. Since each segment constitutes only 80 audio samples, the accuracy of this method is limited by the small sample space. Errors in identifying zero crossing points can still cause a constant power signal to be misdiagnosed as background noise. To address this problem, such schemes may be enhanced by the use of an additional fixed threshold to ensure that high amplitude signals are always determined to be an active signal. However, the use of such a threshold can cause low amplitude, fixed-power signals to now falsely be detected as silence.
Yet another VAD scheme is proposed by Tucker R. in his paper “Voice Activity Detection Using a Periodicity Measure” published August 1992. He describes a VAD that can operate reliably in SNRs down to 0 db and detect most speech at −5 db. The detector applies a least-squares periodicity estimator to the input signal and triggers when a significant amount of periodicity is found. However, it does not aim to find the exact talkspurt boundaries and, consequently, is most suited to speech logging applications, where it is easy to include a small margin to allow for any missed speech. As will be appreciated, a “talkspurt” boundary refers to the boundary between speech and nonspeech audio information (e.g., the boundary between a period of “silence” and a period of voiced speech). The solution is unsuitable for a VoIP system, where detection of exact talkspurt boundaries is vital.