The International Telecommunication Union (ITU) Recommendation G.729 Annex B describes a compression scheme for communicating information about the background noise received in an incoming signal when no voice activity is detected in the signal. This compression scheme is optimized for terminals conforming to Recommendation V.70. The teachings of ITU-T G.729 and Annex B of this document are hereby incorporated into this application by reference.
Traditional speech encoders/decoders (codecs) use synthesized comfort noise to simulate the background noise of a communication link during periods when voice activity is not detected in the incoming signal. By synthesizing the background noise, little or no information about the actual background noise need be conveyed through the communication channel of the link. However, if the background noise is not statistically stationary (i.e., the distribution function varies with time), the simulated comfort noise does not provide the naturalness of the original background noise. Therefore it is desirable to occasionally send some information about the background noise to improve the quality of the synthesized noise when no speech is detected in the incoming signal. An adequate representation of the background noise, in a digitized frame (i.e., a 10 ms portion) of the incoming signal, can be achieved with as few as fifteen digital bits, substantially fewer than the number needed to adequately represent a voice signal. Recommendation G.729 Annex B suggests communicating a representation of the background noise frame only when an appreciable change has been detected with respect to the previously transmitted characterization of the background noise frame, rather than automatically transmitting this information whenever voice activity is not detected in the incoming signal. Because little or no information is communicated over the channel when there is no voice activity in the incoming signal, a substantial amount of channel bandwidth is conserved by the compression scheme.
FIG. 1 illustrates a half-duplex communication link conforming to Recommendation G.729 Annex B. At the transmitting side of the link, a VAD module 1 generates a digital output to indicate the detection of noise or voice energy in the incoming signal. An output value of one indicates the detected presence of voice activity and a value of zero indicates its absence. If the VAD 1 detects voice activity, a G.729 speech encoder 3 is invoked to encode the digital representation of the detected voice signal. However, if the VAD 1 does not detect voice activity, a Discontinuous Transmission/Comfort Noise Generator (noise) encoder 2 is used to code the digital representation of the detected background noise signal. The digital representations of these voice and background noise signals 7 are formatted into data frames containing the information from samples of the incoming analog signal taken during consecutive 10 ms periods.
At the decoder side, the received bit stream for each frame is examined. If the VAD field for the frame contains a value of one, a voice decoder 6 is invoked to reconstruct the analog signal for the frame using the information contained in the digital representation. If the VAD field for the frame contains a value of zero, a noise decoder 5 is invoked to synthesize the background noise using the information provided by the associated encoder.
To make a determination of whether a frame contains voice or noise activity, the VAD 1 extracts and analyzes four parametric characteristics of the information within the frame. These characteristics are the full- and low-band noise energies, the set of Line Spectral Frequencies (LSF), and the zero cross rate. A difference measure between the extracted characteristics of the current frame and the running averages of the background noise characteristics are calculated for each frame. Where small differences are detected, the characteristics of the current frame are highly correlated to those of the running averages for the background noise and the current frame is more likely to contain background noise than voice activity. Where large differences are detected, the current frame is more likely to contain a signal of a different type, such as a voice signal.
An initial VAD decision regarding the content of the incoming frame is made using multi-boundary decision regions in the space of the four differential measures, as described in ITU G.729 Annex B. Thereafter, a final VAD decision is made based on the relationship between the detected energy of the current frame and that of neighboring past frames. This final decision step tends to reduce the number of state transitions.
The running averages of the background noise characteristics are updated only in the presence of background noise and not in the presence of speech. Therefore, an update occurs only when the VAD 1 has identified an incoming frame containing noise activity alone. The characteristics of the incoming frame are compared to an adaptive threshold and an update takes place only if the following three conditions are met:    1) Ef<Ef,avg+3 dB;    2) RC(1)<0.75; and    3) ΔSD<0.0637;where,
Ef=the full-band noise energy of the current frame and is calculated using the equation:
            E      f        =          10      ×                        log          10                ⁡                  [                                    1              240                        ×                          R              ⁢                              (                0                )                                              ]                      ,where R(0) is the first autocorrelation coefficient;                Ef,avg=the average full-band noise energy;        RC(1)=the first reflection coefficient; and        ΔSD=the difference between the measured spectral distance for the current frame and the running average value of the spectral distance, with a ΔSD of 0.0637 corresponding to 254.6 Hz.The full-band noise energy Ef is further updated, as is a counter, Cn, of noise frames according to the following conditions.        Ef,avg=Emin; and        Cn−0,when,        Cn>128; and        Ef,avg<Emin.        
When a frame of noise is detected, the running averages of the background noise characteristics are updated to reflect the contribution of the current frame using a first order Auto-Regressive (AR) scheme. Different AR coefficients are used for different parameters, and different sets of coefficients are used at the beginning of the communication or when a large change of the noise characteristics is detected. The running averages of the background noise characteristics are initialized by averaging the characteristics for the first thirty-two frames (i.e., the first 320 ms) of an established link. Frames having a full-band noise energy Ef of less than −70 dBm are not included in the count of thirty-two frames and are not used to generate the initial running averages.
Based on the conditions established by G.729 Annex B, described above, for updating the running averages of the background noise characteristics, there are common circumstances that cause the running averages to substantially diverge from the background noise characteristics of the current and future frames. These circumstances occur because the conditions for determining when to update the running averages are dependent upon the values of the running averages. Substantial variations of the background noise characteristics, occurring in a brief period of time, decrease the correlation between the current background noise characteristics and the expected background noise characteristics, as represented by the running averages of these characteristics. As the correlation diverges, the VAD 1 has increasing difficulty distinguishing frames of background noise from those containing voice activity. When the divergence reaches a critical point, the VAD 1 can no longer accurately distinguish the background noise from voice activity and, therefore, will no longer update the running averages of the background noise characteristics. Additionally, the VAD 1 will interpret all subsequent incoming signals as voice signals, thereby eliminating the bandwidth savings obtained by discriminating the voice and noise activity.
Without some modification to the algorithm described in Recommendation G.729 Annex B, once the running averages of the background noise characteristics and the actual characteristics become critically diverged, the VAD 1 will not perform as intended through the remaining duration of the established link. Critical divergence occurs in real-world applications when:    1. The VAD receives a very low-level signal at the onset of the channel link and for more than 320 ms;    2. The VAD receives a signal that is not representative of the subsequent signals at the onset of the channel link and for more than 320 ms; and    3. The characteristic features of the background noise change rapidly.In the first instance, the vector containing the running average of the background noise characteristics is initialized with all zeros. In the second instance, the vector contains values far removed from the real background noise characteristics. And in the third instance, the spectral distance differential, ΔSD, will never be less than 0.0637. As the VAD 1 increasingly allocates resources to the conveyance of noise through the communication channel 4, it proportionately decreases the efficiency of the channel 4. An inefficient communication channel is an expensive one. The present invention overcomes these deficiencies.
For completeness, a description of the parameters used to characterize the background noise are described below. Let the set of autocorrelation coefficients extracted from a frame of information representing a 10 ms portion of an incoming signal be designated by:                {R(i)}i=012 A set of line spectral frequencies is derived from the autocorrelation coefficients, in accordance with Recommendation G.729, and is designated by:        {LSFi}i=110 As stated previously, the full-band energy Ef is obtained through the equation:        
            E      f        =          10      ×                        log          10                ⁡                  [                                    1              240                        ×                          R              ⁢                              (                0                )                                              ]                      ,where R(0) is the first autocorrelation coefficient;The low-band energy, measured between the frequency spectrum of zero to some upper frequency limit, Fl, is obtained through the equation:
            E      l        =          10      ×                        log          10                ⁡                  [                                    1              240                        ×                          h              T                        ×            R            ×            h                    ]                      ,where h is the impulse response of an FIR filter with a cutoff frequency at Fl Hz and R is the Toeplitz autocorrelation matrix with the autocorrelation coefficients on each diagonal.The normalized zero crossing rate is given by the equation:
      Z    ⁢                  ⁢    C    =            1      160        ×          ∑              [                              |                                          sgn                ⁡                                  (                                      x                    ⁡                                          (                      i                      )                                                        )                                            -                              sgn                (                                                      x                    ⁡                                          (                                              i                        -                        1                                            )                                                        |                                ⁢                                                                  ]                                              ,                    where x(i) is the pre-processed input signal.
For the first thirty-two frames, the average spectral parameters of the background noise, denoted by {LSFavg}, are initialized as an average of the line spectral frequencies of the frames and the average of the background noise zero crossing rate, denoted by ZCavg, is initialized as an average of the zero crossing rate, ZC, of the frames. The running averages of the full-band background noise energy, denoted by Ef,avg, and the background noise low-band energy, denoted by El,avg, are initialized as follows. First, the initialization procedure substitutes En,avg for the average of the frame energy, Ef, over the first thirty-two frames. The three parameters, {LSFavg}, ZCavg, and En,avg, include only the frames that have an energy , Ef, greater than −70 dBm. Thereafter, the initialization procedure sets the parameters as follows:
If En,avg≦T1, then                Ef,avg=En,avg         El,avg=En,avg−53,687,091        
else if T1<En,avg<T2, then                Ef,avg=En,avg−67,108,864        El,avg=En,avg−93,952,410        
else                Ef,avg=En,avg−134,217,728        El,avg=En,avg−161,061,274A long-term minimum energy parameter, Emin, is calculated as the minimum value of Ef over the previous 128 frames.        
Four differential values are generated from the differences between the current frame parameters and the running averages of the background noise parameters. The spectral distortion differential value is generated as the sum of squares of the difference between the current frame {LSFi}i=110 vector and the running averages of the spectral distortion {LSFavg} and may be expressed by the equation:
      Δ    ⁢                  ⁢    S    =            ∑              i        =        1            10        ⁢                  (                              LSF            i                    -                      LSF                          i              ,              avg                                      )            2      The full-band energy differential value may be expressed as:                ΔEf=Ef,avg−Ef, where Ef is the low-band energy of the current frame.The low-band energy differential value may be expressed as:        ΔEl=El,avg−El, where El is the low-band energy of the current frame.Lastly, the zero crossing rate differential value may be expressed as:        ΔZC=ZCavg−ZC, where ZC is the zero crossing rate of the current frame.        