The International Telecommunication Union (ITU) Recommendation G.729 Annex B describes a compression scheme for communicating information about the background noise received in an incoming signal when no voice is detected in the signal. This compression scheme is optimized for terminals conforming to Recommendation V.70. The teachings of ITU-T G.729 and Annex B of the Recommendation are hereby incorporated into this application by reference.
Conventional speech decoders use synthesized comfort noise to simulate the background noise of a communication link during periods when voice is not detected in the incoming signal. By synthesizing the background noise, little or no information about the actual background noise need be conveyed through the communication channel of the link. However, if the background noise is not statistically stationary (i.e., the distribution function varies with time), the simulated comfort noise does not provide the naturalness of the original background noise. Therefore it is desirable to occasionally send some information about the background noise to improve the quality of the synthesized noise when no speech is detected in the incoming signal.
An adequate representation of the background noise, in a digitized frame (i.e., a 10 ms portion) of the incoming signal, can be achieved with as few as fifteen bits, substantially fewer than the number needed to adequately represent a voice signal.
The G.729 recommendation provides voice activity detection (VAD), discontinuous transmission (DTX), and Comfort Noise Generator (CNG) algorithms. The output of the VAD module is either 1 or 0, indicating the presence or absence of voice activity respectively. If the VAD output is 1, the G.729 speech codec is invoked to encode the active voice frames. However, if the VAD output is 0, the DTX/CNG algorithms described herein are used to encode the non-active voice frames. Traditional speech coders and decoders use comfort noise to simulate the background noise in the non-active voice frames. If the background noise is not stationary, a mere comfort noise insertion does not provide the naturalness of the original background noise. Therefore it is desirable to intermittently send some information about the background noise in order to obtain a better quality when non-active voice frames are detected. The coding efficiency of the non-active voice frames can be achieved by coding the energy of the frame and its spectrum with as few as fifteen bits. These bits are not automatically transmitted whenever there is a non-active voice detection. Rather, the bits are transmitted only when an appreciable change has been detected with respect to the last transmitted non-active voice frame. At the decoder side, the received bit stream is decoded. If the VAD output is 1, the G.729 decoder is invoked to synthesize the reconstructed active voice frames. If the VAD output is 0, the CNG module is called to reproduce the non-active voiced frames.
FIG. 1 illustrates a half-duplex communication link conforming to Recommendation G.729 Annex B. At the transmitting side of the link, a VAD module 12 generates a digital output to indicate the detection of noise or voice in the incoming signal. An output value of one indicates the detected presence of voice and a value of zero indicates its absence. If the VAD 1 detects voice, a G.729 speech encoder 6 is invoked to encode the digital representation of the detected voice signal. However, if the VAD 12 does not detect voice, a Discontinuous Transmission/Comfort Noise Generator (noise) encoder 14 is used to code the digital representation of the detected background noise signal. The digital representations of these voice and background noise signals 7 are formatted into data frames containing the information from samples of the incoming signal taken during consecutive time periods. (e.g., frames can be formatted into 10 ms frame sizes). A noise encoder and voice encoder input frames into a bit stream, and the bit stream will transmit the frames into a communication channel.
At the decoder side, the received bit stream for each frame is examined to determine whether to invoke either the voice or noise decoder. The examination process for each frame includes an evaluation of the protocol and codec, frame or packet type, and length of a packet. If no packet arrives in the bit stream during a noise session, then a comfort noise packet is generated based on the most recent SID packet that arrived at the decoder side.
To make a determination of whether a frame contains voice or noise, the VAD 1 extracts and analyzes four parametric characteristics of the information within the frame. These characteristics are the full- and low-band energies, the set of Line Spectral Frequencies (LSF), and the zero cross (ZC) rate. A set of difference measures between the extracted characteristics of the current frame and the running averages of the background noise characteristics are calculated for each frame. The difference between the current frame and the running average represents the characteristics of the noise. Where small differences in characteristics are detected, the characteristics of the current frame are highly correlated to those of the running averages for the background noise and the current frame is more likely to contain background noise than voice. Where large differences are detected, the current frame is more likely to contain a signal of a different type, such as a voice signal.
An initial VAD decision regarding the content of the incoming frame is made using multi-boundary decision regions in the space of the four differential measures, as described in ITU G.729 Annex B. Thereafter, a final VAD decision is made based on the relationship between the detected energy of the current frame and that of neighboring past frames. This final decision step tends to reduce the number of state transitions.
The running averages have to be updated only in the presence of background noise, and not in the presence of speech. An adaptive update is as follows:    if ((Ef<Ēf +3 dB & RC (1)<0.75) or SD<0.0637) then update where Ēf is average full band noise energy, RC (1) is the first reflection coefficient, and SD is spectral distance. Let Cn be the total number of frames where the update condition was satisfied. Ēf and Cn are further updated according to:
  if  ⁢          ⁢      (                  frame        ⁢                                  ⁢        count            >              N        0              )    ⁢          ⁢  and  ⁢          ⁢      (                            E          _                f            <              E        min              )    ⁢      {                                                                      E                _                            f                        =                          E              min                                                                                      C              n                        =            0                                }  As recited in ITU recommendation G.729B, the normalized zero crossing rate is given by equation (B.3), as recited below:
  ZC  =            1              2        ⁢        M              ×                  ∑                  i          =          0                          M          -          1                    ⁢                          ⁢              [                                                                        sgn                ⁡                                  (                                      x                    ⁡                                          (                      i                      )                                                        )                                            -                              sgn                (                                  x                  ⁡                                      (                                          i                      -                      1                                        )                                                                                        ]                    ,                    where x(i) is the pre-processed input signal.
G.729B recommends using the first thirty-two frames to initialize the average and calculate the line spectral frequencies (LSF), full band energy, low band energy, and zero crossing rate. The average spectral parameters of the background noise, denoted by {LSFavg}, are initialized as an average of the line spectral frequencies of the frames, the average of the background noise zero crossing rate, denoted by ZCavg is initialized as an average of the zero crossing rate, ZC, and the average full and low band energies of the frames. If the frame contains voice or tone packets during the initialization instead of noise, the G.729B VAD recommended solution can fail to detect any noise during voice or tone signal transmissions due to problems associated with measuring the samples at the zero crossing, resulting in poor performance of the voice activity detector. The G.729B recommended standard calculates the zero crossing rate based upon the multiplication of consecutive signals. If the sample point is at a zero crossing point, the calculations cannot count the point as a zero crossing because the sample has a zero amplitude and a tone signal will be detected as noise, causing errors in a voice activity detector. Therefore, as long as there is a zero amplitude in the signal, the same problem arises and the recommended calculations cannot measure the signal at the zero crossing point.
Without some modification to the recommendation in G.729B, when the recommended algorithm counts samples for the zero crossing rate, it will not count a sample whose amplitude is zero, resulting in an inaccurate zero crossing rate calculation. Therefore, what is needed is a method for correcting the errors associated with calculating a zero crossing rate for a voice activity detector and a method to detect tone signals based upon the correct zero crossing rate.