1. Field of the Invention
The present invention relates to telecommunications systems and more particularly to a mechanism for detecting voice activity in a communications signal and for distinguishing voice activity from noise, quiescence or silence.
2. Description of Related Art
In telecommunications systems, a need often exists to determine whether a communications signal contains voice or other meaningful audio activity (hereafter referred to as "voice activity" for convenience) and to distinguish such voice activity from mere noise and/or silence. The ability to efficiently draw this distinction is useful in many contexts.
As an example, a digital telephone answering device (TAD) will typically have a fixed amount of memory space for storing voice messages. Ideally, this memory space should be used for storing only voice activity, and periods of silence should be stored as tokens rather than as silence over time. Unfortunately, however, noise often exists in communications signals. For instance, a signal may be plagued with low level cross-talk (e.g., inductive coupling of conversations from adjacent lines), pops and clicks (e.g., from bad lines), various background noise and/or other interference. Since noise is not silent, a problem exists: in attempting to identify silence to store as a token, the TAD may interpret the line noise as speech and may therefore store the noise notwithstanding the absence of voice activity. As a result, the TAD may waste valuable memory space.
As another example, in many telecommunications systems, voice signals are encoded before being transmitted from one location to another. The process of encoding serves many purposes, not the least of which is compressing the signal in order to conserve bandwidth and to therefore increase the speed of communication. One method of compressing a voice signal is to encode periods of silence or background noise with a token. Similar to the example described above, however, noise can unfortunately be interpreted as a voice signal, in which case it would not be encoded with a token. Hence, the voice signal may not be compressed as much as possible, resulting in a waste of bandwidth and slower (and potentially lower quality) communication.
As still another example, numerous applications now employ voice recognition technology. Such applications include, for example, telephones with voice activated dialing, voice activated recording devices, and various electronic device actuators such as remote controls and data entry systems. By definition, such applications require a mechanism for detecting voice and distinguishing voice from other noises. Therefore, such mechanisms can suffer from the same flaw identified above, namely an inability to sufficiently distinguish and detect voice activity.
A variety of speech detection systems currently exist. One type of system, for instance, relies on a spectral comparison of the communications signal with a spectral model of common noise or speech harmonics. An example of one such system is provided by the GSM 6.32 standard for mobile (cellular) communications promulgated by the Global System for Mobile Communications. According to GSM 6.32, the communications signal is passed through a multi-pole filter to remove typical noise frequency components from the signal. The coefficients of the multi-pole filter are adaptively established by reference to the signal during long periods of noise, where such periods are identified by spectral analysis of the signal in search of fairly static frequency content representative of noise rather than speech. Over each of a sequence of frames, the energy output from the multi-pole filter is then compared to a threshold level that is also adaptively established by reference to the background noise, and a determination is made whether the energy level is sufficient to represent voice.
Unfortunately, such spectral-based voice activity detectors necessitate complex signal processing and delays in order to establish the filter coefficients necessary to remove noise frequencies from the communication signal. For instance, with such systems it becomes necessary to establish the average pole placement over a number of sequential frames and to ensure that those poles do not change substantially over time. For this reason, the GSM standard looks for relatively constant periodicity in the signal before establishing a set of filter coefficients.
Further, any system that is based on a presumption as to the harmonic character of noise and speech is unlikely to be able to distinguish speech from certain types of noise. For instance, low level cross-talk may contain spectral content akin to voice and may therefore give rise to false voice detection. Further, a spectral analysis of a signal containing low level cross-talk could cause the GSM system to conclude that there is an absence of constant noise. Therefore, the filter coefficients established by the GSM system may not properly reflect the noise, and the adaptive filter may fail to eliminate noise harmonics as planned. Similarly, pops and clicks and other non-stationary components of noise may not fit neatly into an average noise spectrum and may therefore pass through the adaptive filter of the GSM system as voice and contribute to a false detection of voice.
Another type of voice detection system, for instance, relies on a combined comparison of the energy and zero crossings of the input signal with the energy and zero crossings believed to be typical in background noise. As described in Lawrence R. Rabiner & Ronald W. Schafer, Digital Processing of Speech Signals 130-135 (Prentice Hall 1978), this procedure may involve taking the number of zero crossings in an input signal over a 10 ms time frame and the average signal amplitude over a 10 ms window, at a rate of 100 times/second. If over the first 100 ms, it is assumed that the signal contains no speech, then the mean and standard deviation of the average magnitude and zero crossing rate for this interval should give a statistical characterization of the background noise. This statistical characterization may then be used to compute a zero-crossing rate threshold and an energy threshold. In turn, average magnitude profile zero-crossing rate profiles of the signal can be compared to the threshold to give an indication of where the speech begins and ends.
Unfortunately, however, this system of voice detection relies on a comparison of signal magnitude to expected or assumed threshold levels. These threshold levels are often inaccurate and can give rise to difficulty in identifying speech that begins or ends with weak frickatives (e.g., "f", "th", and "h" sounds) or plosive bursts (e.g., "p", "t" or "k" sounds), as well as distinguishing speech from noise such as pops and clicks. Further, while an analysis of energy and zero crossings may work to detect speech in a static sound recording, the analysis is likely to be too slow and inefficient to detect voice activity in real-time media streams.
In view of the deficiencies in these and other systems, a need exists for an improved mechanism for detecting voice activity and distinguishing voice from noise or silence.