1. Field of the Invention
The present invention relates generally to voice activity detection. More particularly, the present invention relates to a tone detection algorithm for a voice activity detector.
2. Related Art
In 1996, the Telecommunication Sector of the International Telecommunication Union (ITU-T) adopted a toll quality speech coding algorithm known as the G.729 Recommendation, entitled “Coding of Speech Signals at 8 kbit/s using Conjugate-Structure Algebraic-Code-Excited Linear-Prediction (CS-ACELP).” Shortly thereafter, the ITU-T also adopted a silence compression algorithm known as the ITU-T Recommendation G.729 Annex B, entitled “A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications.” The ITU-T G.729 and G.729 Annex B specifications are hereby incorporated by reference into the present application in their entirety.
Although initially designed for DSVD (Digital Simultaneous Voice and Data) applications, the ITU-T Recommendation G.729 Annex B (G.729B) has been heavily used in VoIP (Voice over Internet Protocol) applications, and will continue to serve the industry in the future. To save bandwidth, G.729B allows G.729 (and its annexes) to operate in two transmission modes, voice and silence/background noise, which are classified using a Voice Activity Detector (VAD).
A considerable portion of normal speech is made up of silence/background noise, which may be up to an average of 60 percent of a two-way conversation. During silence, the speech input device, such as a microphone, picks up environmental noise. The noise level and characteristics can vary considerably, from a quiet room to a noisy street or a fast-moving car. However, most of the noise sources carry less information than the speech; hence, a higher compression ratio is achievable during inactive periods. As a result, many practical applications use silence detection and comfort noise injection for higher coding efficiency.
In G.729B, this concept of silence detection and comfort noise injection leads to a dual-mode speech coding technique, where the different modes of input signal, denoted as active voice for speech and inactive voice for silence or background noise, are determined by a VAD. The VAD can operate externally or internally to the speech encoder. The full-rate speech coder is operational during active voice speech, but a different coding scheme is employed for the inactive voice signal, using fewer bits and resulting in a higher overall average compression ratio. The output of the VAD may be called a voice activity decision. The voice activity decision is either 1 or 0 (on or off), indicating the presence or absence of voice activity, respectively. The VAD algorithm and the inactive voice coder, as well as the G.729 or G.729A speech coders, operate on frames of digitized speech.
FIG. 1 illustrates conventional speech coding system 100, including encoder 101, communication channel 125 and decoder 102. As shown, encoder 101 includes VAD 120, active voice encoder 115 and inactive voice encoder 110. VAD 120 determines whether input signal 105 is a voice signal. If VAD 120 determines that input signal 105 is a voice signal, VAD output signal 122 causes input signal 105 to be routed to active voice encoder 115 and then routed to the output of active voice encoder 115 for transmission over communication channel 125. On the other hand, If VAD 120 determines that input signal 105 is not a voice signal, VAD output signal 122 causes input signal 105 to be routed to inactive voice encoder 110 and then routed to the output of inactive voice encoder 110 for transmission over communication channel 125. Further, VAD output signal 122 is also transmitted over communication channel 125 and received by decoder 102 as coding mode 127, such that at the other end, coding mode 127 controls whether the coded signal should be decoded using inactive voice decoder 130 or active voice decoder 135 to produce output signal 140.
When active voice encoder 115 is operational, an active voice bitstream is sent to active voice decoder 135 for each frame. However, during inactive periods, inactive voice encoder 110 can choose to send an information update called a silence insertion descriptor (SID) to the inactive decoder, or to send nothing. This technique is named discontinuous transmission (DTX). When an inactive voice is declared by VAD 120, completely muting the output during inactive voice segments creates sudden drops of the signal energy level which are perceptually unpleasant. Therefore, in order to fill these inactive voice segments, a description of the background noise is sent from inactive voice encoder 110 to inactive voice decoder 130. Such a description is known as a silence insertion description. Using the SID, inactive voice decoder 130 generates output signal 140, which is perceptually equivalent to the background noise in the encoder. Such a signal is commonly called comfort noise, which is generated by a comfort noise generator (CNG) within inactive voice decoder 130.
Due to an increase in deployment and use of VoIP applications, certain deficiencies of speech coding algorithms and, in particular, existing VAD algorithms have surfaced. For example, it has been experienced that the VAD erroneously may go off (indicative of inactive voice) at the tail end of a voice signal, although the voice signal is still present. As a result, the tail end of the voice signal is cut off by the VAD. FIG. 2 is an illustration of this first problem, where VAD 120 goes off at point 210, where voice signal still continues, and thus VAD 120 cuts off the tail end of voice signal 212. In other words, the CNG matches the energy of the tail end of the voice signal (i.e. energy of the signal after VAD goes off) for generating the comfort noise. Because the matched energy is not that of a silence or background noise signal, but the matched energy is that of the tail end of a voice signal, the comfort noise that is generated by the CNG sounds like an annoying breathe-like noise.
In a further problem, it has been determined that existing VADs occasionally misinterpret a high-level tone signal as an inactive voice or background noise, which results in the CNG generating a comfort noise by matching the energy of the high-level tone signal.
Other VAD problems may also be caused due to untimely or improper initialization or update of the noise state during the VAD operation. It is known that the background noise can change considerably during a conversation, for example, by moving from a quiet room to a noisy street, a fast-moving car, etc. Therefore, the initial parameters indicative of the varying characteristics of background noise (or the noise state) must be updated for adaptation to the changing environment. However, when the background noise parameters are not timely or properly updated or initialized, various problems may occur, including (a) undesirable performance for input signals that start below a certain level, such as around 15 dB, (b) undesirable performance in noisy environments, (c) waste of bandwidth by excessive use of SID frames, and (d) incorrect initialization of noise characteristics when noise is missing at the beginning of the speech. As an example, when the incoming signal starts with silence followed by a sudden change in the level of noise signal, existing VADs do not initialize the noise state correctly, which can lead to the noise signal following the silence erroneously being considered as the active voice by the VAD. As a result of this improper initialization of the noise state, the VAD may go on during background noise periods causing an active voice mode selection, where the bandwidth is wasted for coding of the background noise.
Therefore, there is an intense need for a robust VAD algorithm that can overcome the existing problems and deficiencies in the art.