The International Telecommunication Union (ITU) Recommendation G.729 Annex B describes a compression scheme for communicating information about the background noise received in an incoming signal when no voice is detected in the signal. This compression scheme is optimized for terminals conforming to Recommendation V.70. The teachings of ITU-T G.729 and Annex B of the Recommendation are hereby incorporated into this application by reference.
Traditional speech encoders/decoders (codecs) use synthesized comfort noise to simulate the background noise of a communication link during periods when voice is not detected in the incoming signal. By synthesizing the background noise, little or no information about the actual background noise need be conveyed through the communication channel of the link. However, if the background noise is not statistically stationary (i.e., the distribution function varies with time), the simulated comfort noise does not provide the naturalness of the original background noise. Therefore it is desirable to occasionally send some information about the background noise to improve the quality of the synthesized noise when no speech is detected in the incoming signal. An adequate representation of the background noise, in a digitized frame (i.e., a 10 ms portion) of the incoming signal, can be achieved with as few as fifteen digital bits, substantially fewer than the number needed to adequately represent a voice signal. Recommendation G.729 Annex B suggests communicating a representation of the background noise frame only when an appreciable change has been detected with respect to the previously transmitted characterization of the background noise frame, rather than automatically transmitting this information whenever voice is not detected in the incoming signal. Because little or no information is communicated over the channel when there is no voice in the incoming signal, a substantial amount of channel bandwidth is conserved by the compression scheme.
FIG. 1 illustrates a half-duplex communication link conforming to Recommendation G.729 Annex B. At the transmitting side of the link, a VAD module 1 generates a digital output to indicate the detection of noise or voice in the incoming signal. An output value of one indicates the detected presence of voice and a value of zero indicates its absence. If the VAD 1 detects voice, a G.729 speech encoder 3 is invoked to encode the digital representation of the detected voice signal. However, if the VAD 1 does not detect voice, a Discontinuous Transmission/Comfort Noise Generator (noise) encoder 2 is used to code the digital representation of the detected background noise signal. The digital representations of these voice and background noise signals 7 are formatted into data frames containing the information from samples of the incoming signal taken during consecutive 10 ms periods.
At the decoder side, the received bit stream for each frame is examined. If the VAD field for the frame contains a value of one, a voice decoder 6 is invoked to reconstruct the signal for the frame using the information contained in the digital representation. If the VAD field for the frame contains a value of zero, a noise decoder 5 is invoked to synthesize the background noise using the information provided by the associated encoder.
To make a determination of whether a frame contains voice or noise, the VAD 1 extracts and analyzes four parametric characteristics of the information within the frame. These characteristics are the full- and low-band energies, the set of Line Spectral Frequencies (LSF), and the zero cross rate. A difference measure between the extracted characteristics of the current frame and the running averages of the background noise characteristics is calculated for each frame. Where small differences are detected, the characteristics of the current frame are highly correlated to those of the running averages for the background noise and the current frame is more likely to contain background noise than voice. Where large differences are detected, the current frame is more likely to contain a signal of a different type, such as a voice signal.
An initial VAD decision regarding the content of the incoming frame is made using multi-boundary decision regions in the space of the four differential measures, as described in ITU G.729 Annex B. Thereafter, a final VAD decision is made based on the relationship between the detected energy of the current frame and that of neighboring past frames. This final decision step tends to reduce the number of state transitions.
The running averages of the background noise characteristics are updated only in the presence of background noise and not in the presence of speech. The characteristics of the incoming frame are compared to an adaptive threshold and an update takes place only if certain conditions are met, as described in Recommendation G.729 B.
When the specified conditions are met, the running averages of the background noise characteristics are updated to reflect the contribution of the current frame using a first order Auto-Regressive (AR) scheme. Different AR coefficients are used for different parameters, and different sets of coefficients are used at the beginning of the communication or when a large change of the noise characteristics is detected. These AR coefficients are related to the running averages of the four background noise characteristics, {{overscore (LSF)}i}i=110, Ēf, Ēl, and {overscore (ZC)}, in the following way.
Let βEf identify the AR coefficient for the update of Ēf, βEl identify the AR coefficient for the update of Ēl, βZC identify the AR coefficient for the update of {overscore (ZC)}, and βLSF identify the AR coefficient for the update of {{overscore (LSF)}i}i=1p. The AR update is done according to the equations:Ēf=βEf·Ēf+(1−βEf)·Ef;  (1)Ēl=βEl·Ēl+(1−βEl)·El;  (2){overscore (ZC)}=βZC·{overscore (ZC)}+(1−βZC)·ZC; and  (3){overscore (LSF)}i=βLSF·{overscore (LSF)}i+(1−βLSF)·LSFi.  (4)
The running averages of the background noise characteristics are initialized by averaging the characteristics for the first thirty-two frames (i.e., the first 320 ms) of an established link. If all of the first thirty-two frames have full-band energies Ef of less than 15 dB, then the four background noise characteristics, {{overscore (LSF)}i}i=110, Ēf, Ēl, and {overscore (ZC)}, are initialized to zero.
Based on the conditions established by G.729 Annex B, described above, for updating the running averages of the background noise characteristics, there are common circumstances that cause the running averages to substantially diverge from the background noise characteristics of the current and future frames. These circumstances occur because the conditions for determining when to update the running averages are dependent upon the values of the running averages. Substantial variations of the background noise characteristics, occurring in a brief period of time, decrease the correlation between the current background noise characteristics and the expected background noise characteristics, as represented by the running averages of these characteristics. As the correlation diverges, the VAD 1 has increasing difficulty distinguishing frames of background noise from those containing voice. When the divergence reaches a critical point, the VAD 1 can no longer accurately distinguish the background noise from voice and, therefore, will no longer update the running averages of the background noise characteristics. Additionally, the VAD 1 will interpret all subsequent incoming signals as voice signals, thereby eliminating the bandwidth savings obtained by discriminating the voice and noise.
Without some modification to the algorithm described in Recommendation G.729 Annex B, once the running averages of the background noise characteristics and the actual characteristics become critically diverged, the VAD 1 will not perform as intended through the remaining duration of the established link. Critical divergence occurs in real-world applications when:                1. The VAD receives a very low-level signal at the onset of the channel link and for more than 320 ms;        2. The VAD receives a signal that is not representative of the background noise at the onset of the channel link and for more than 320 ms; and        3. The characteristic features of the background noise change rapidly.        
In the first instance, the beginning of the vector containing the running average of the background noise characteristics is initialized with all zeros. In the second instance, the vector contains values far different from the real background noise characteristics. And in the third instance, the spectral distortion, ΔS, will never be less than 83, as is required to cause an update. As the VAD 1 increasingly allocates resources to the conveyance of noise through the communication channel 4, it proportionately decreases the efficiency of the channel 4. An inefficient communication channel is an expensive one. The present invention overcomes these deficiencies.
For completeness, a description of the four parameters used to characterize the background noise are described below. Let the set of autocorrelation coefficients extracted from a frame of information representing a 10 ms portion of an incoming signal be designated by:{R(i)}i=012A set of line spectral frequencies is derived from the autocorrelation coefficients, in accordance with Recommendation G.729, and is designated by:{LSFi}i=110As stated previously, the full-band energy Ef is obtained through the equation:
            E      f        =          10      ×                        log          10                ⁡                  [                                    1              240                        ×                          R              ⁢                              (                0                )                                              ]                      ,where R(0) is the first autocorrelation coefficient;The low-band energy, measured between the frequency spectrum of zero to some upper frequency limit, F1, is obtained through the equation:
            E      l        =          10      ×                        log          10                ⁡                  [                                    1              240                        ×                          h              T                        ×            R            ×            h                    ]                      ,where h is the impulse response of an FIR filter with a cutoff frequency at F1 Hz and R is the Toeplitz autocorrelation matrix with the autocorrelation coefficients on each diagonal.The normalized zero crossing rate is given by the equation:
      Z    ⁢                  ⁢    C    =            1      160        ×          ∑              [                              |                                          s                ⁢                                                                  ⁢                g                ⁢                                                                  ⁢                                  n                  ⁢                                      (                                          x                      ⁢                                              (                        i                        )                                                              )                                                              -                              s                ⁢                                                                  ⁢                g                ⁢                                                                  ⁢                                                      n                    ⁢                                          (                                                                        x                          ⁢                                                      (                                                          i                              -                              1                                                        )                                                                          |                                            ⁢                                                                                                                            ]                                                              ,                    where x(i) is the pre-processed input signal.
For the first thirty-two frames, the average spectral parameters of the background noise, denoted by {{overscore (LSF)}i}i=110, are initialized as an average of the line spectral frequencies of the frames and the average of the background noise zero crossing rate, denoted by {overscore (ZC)}, is initialized as an average of the zero crossing rate, ZC, of the frames. The running averages of the full-band background noise energy, denoted by Ēf, and the background noise low-band energy, denoted by Ēl, are initialized as follows. First, the initialization procedure calculates Ēn, which is the average frame energy, Ef, over the first thirty-two frames. Note, the three parameters, {{overscore (LSF)}i}i=110, {overscore (ZC)}, and Ēn, are only averaged over the frames that have an energy, Ef, greater than 15 dB. Thereafter, the initialization procedure sets the parameters as follows:
If Ēn≦671,088,640, thenĒf=ĒnĒl=Ēn−53,687,091
else if 671,088,640<Ēn<738,197,504 thenĒf=Ēn−67,108,864Ēl=Ēn−93,952,410elseĒf=Ēn−134,217,728Ēl=Ēn−161,061,274A long-term minimum energy parameter, Emin, is calculated as the minimum value of Ef over the previous 128 frames.
Four differential values are generated from the differences between the current frame parameters and the running averages of the background noise parameters. The spectral distortion differential value is generated as the sum of squares of the difference between the current frame {LSFi}i=110 vector and the running averages of the spectral distortion {{overscore (LSF)}i}i=110 and may be expressed by the equation:
      Δ    ⁢                  ⁢    S    =            ∑              i        =        1            10        ⁢                  (                              L            ⁢                                                  ⁢            S            ⁢                                                  ⁢                          F              i                                -                                                    L                ⁢                                                                  ⁢                S                ⁢                                                                  ⁢                F                            _                        i                          )            2      The full-band energy differential value may be expressed as:
ΔEf=Ēf−Ef, where Ef is the full-band energy of the current frame.
The low-band energy differential value may be expressed as:
ΔEl=Ēl−El, where El is the low-band energy of the current frame.
Lastly, the zero crossing rate differential value may be expressed as:
ΔZC={overscore (ZC)}−ZC, where ZC is the zero crossing rate of the current frame.