In speech communications, speech processing is mainly performed by speech codecs. Since a speech signal has short-time stability, speech codecs generally process the speech signal in frames, each frame being of 10 to 30 ms. All the initial speech codecs have fixed rates, that is, each of the codecs has only one fixed coding rate. For example, the coding rate of a G.729 speech codec is 8 kbit/s, and the coding rate of a G.728 speech codec is 16 kbit/s. As a whole, among these traditional speech codecs with fixed coding rate, the speech codecs with higher coding rate may guarantee coding quality more easily, but occupy more communication channel resources; while the speech codecs with lower coding rate may not guarantee coding quality that easily, but occupy less communication channel resources.
The speech signal includes both a voice signal generated by human speaking and a silent signal generated by gaps in human speaking. The coding rate of the voice signal is referred to as speech (in this case, the speech specifically refers to a signal of human speaking) coding rate, and the coding rate of background noise is referred to as noise coding rate. In speech communications, only the useful voice signal is concerned, while the useless silent signal is not desired to be transmitted, and this decreases transmission bandwidth. However, if merely the voice signal is coded and transmitted and the silent signal is not coded and transmitted, the discontinuity of background noise would occur. Thus a person who is listening at a receiving end will feel rather uncomfortable, and such feeling will be more apparent in the case of stronger background noise so that sometimes the speech would be difficult to understand. In order to solve this problem, the silent signal needs to be coded and transmitted even when no one is speaking Silence compression technology is introduced into speech codecs. In the silence compression technology, the background noise signal is coded with lower coding rate to efficiently decrease communications bandwidth, while the voice signal generated by human speaking is coded with higher coding rate to guarantee communications quality.
At present, an approach for generating an excitation signal for background noise for a G.729B speech codec adds a Discontinuous Transmission System (DTX)/Comfort Noise Generated (CNG) system, i.e., a system for processing background noise, to the prototype of the G.729B speech codec. The system processes 8 kHz-sampled narrowband signals with a frame length of 10 ms for signal processing. According to a CNG algorithm, a level-controllable pseudo white noise is used to excite an interpolated Linear Predictive Coding (LPC) synthesis filter to obtain comfortable background noise, where the level of the excitation signal and the coefficient of the LPC filter are obtained from the previous Silence Insertion Descriptor (SID) frame.
The excitation signal is a pseudo white noise excitation ex(n) which is a mixture of a speech excitation ex1(n) and a Gauss white noise excitation ex2(n). The gain of ex1(n) is relatively small, and ex1(n) is utilized to make the transition from speech to non-speech (such as, noise, etc.) more natural. After the pseudo white noise excitation ex(n) is obtained, ex(n) could be used to excite the synthesis filter to obtain comfortable background noise.
The process for generating the excitation signal is as follows.
Firstly, a target excitation gain {tilde over (G)}t is defined as a square root of average energy of current frame excitations. {tilde over (G)}t is obtained based on the following smoothing algorithm:
            G      ~        t    =      {                                                      G              ~                        sid                                                if            ⁢                                                  ⁢                          (                                                Vad                                      t                    -                    1                                                  =                1                            )                                                                                                      7                8                            ⁢                                                G                  ~                                                  t                  -                  1                                                      +                                          1                8                            ⁢                                                G                  ~                                sid                                                              otherwise                                    where {tilde over (G)}sid, is the gain of a decoded SID frame.        
For each of two sub-frames which are formed by dividing 80 sampling points, the excitation signal of a CNG module may be synthesized by:                (1) randomly selecting a pitch lag in a range of [40, 103];        (2) randomly selecting positions and signs of non-zero pulses in fixed codebook vectors of the sub-frames (the structure of the positions and signs of the non-zero pulses is the same as that of the G.729 speech codec); and        (3) selecting a self-adaptive codebook excitation signal with a gain, labeling the self-adaptive codebook excitation signal as ea(n),n=0 . . . 39, labeling a selected fixed codebook excitation signal as ef(n),n=0 . . . 39, and then calculating a self-adaptive codebook gain Ga and a fixed codebook gain Gf based on the energy of the sub-frames:        
            1      40        ⁢                  ∑                  n          =          0                39            ⁢                        (                                                    G                a                            ×                                                e                  a                                ⁡                                  (                  n                  )                                                      +                                          G                f                            ×                                                e                  f                                ⁡                                  (                  n                  )                                                              )                2              =            G      ~        t    2                  where Gf may be selected as a negative value.        
It is defined that
            E      a        =          (                        ∑                      n            =            0                    39                ⁢                                            e              a                        ⁡                          (              n              )                                2                    )        ,          ⁢      I    =                  (                              ∑                          n              =              0                        119                    ⁢                                                    e                a                            ⁡                              (                n                )                                      ⁢                                          e                f                            ⁡                              (                n                )                                                    )            .      According to the excitation structure of Algebra Code-Excited Linear Prediction (ACELP), it could be known that
            ∑              n        =        0            39        ⁢                            e          f                ⁡                  (          n          )                    2        =  4.
If the self-adaptive codebook gain Ga is fixed, the equation expressing {tilde over (G)}t will become a second order equation related to Gf:
            G      f      2        +                                        G            a                    ×          I                2            ⁢              G        f              +                                        E            a                    ×                      G            a            2                          -        K            4        =  0
The value of Ga may be defined to ensure that the above equation has solutions. Further, the application of some self-adaptive codebook gains with large values may be restricted. Thus, the self-adaptive codebook gain Ga may be randomly selected in the following range:
      [          0      ,              Max        ⁢                  {                      0.5            ,                                          K                A                                              }                      ]    ,            with      ⁢                          ⁢      A        =                  E        a            -                        I          2                /        4                            where the root with the smallest absolute value among the roots of the equation of        
            1      40        ⁢                  ∑                  n          =          0                39            ⁢                        (                                                    G                a                            ×                                                e                  a                                ⁡                                  (                  n                  )                                                      +                                          G                f                            ×                                                e                  f                                ⁡                                  (                  n                  )                                                              )                2              =            G      ~        t    2  is used as the value of Gf.
Finally, the excitation signal for the G.729 speech codec may be constructed with the following equation:ex1(n)=Ga×ea(n)+Gf×ef[n],n=0 . . . 39
The excitation ex(n) may be synthesized in the following manner.
It is assumed that E1 is the energy of ex1(n), E2 is the energy of ex2(n), and E3 is a dot product of ex1(n) and ex2(n):E1=Σex12(n)E2=Σex22(n)E3=Σex1(n)·ex2(n)                where the calculated number of dots exceeds the value of themselves.        
It is assumed that α and β are proportional coefficients of ex1(n) and ex2(n) in a mixed excitation respectively, where α is set to 0.6 and β is determined based on the following quadratic equation:β2E2+2αβE3+(α2−1)E1=0, with β>0.
If there is no solution for β, β will be set to 0 and α will be set to 1. The final excitation ex(n) for the CNG module becomes:ex(n)=αex1(n)+βex2(n)
The above discussion illustrates the principle of generating an excitation signal for background noise for the CNG module of the G.729B speech codec.
According to the implementation process described above, certain speech excitation ex1(n) may be added when generating an excitation signal for background noise for the G.729B speech codec. However, the speech excitation ex1(n) is just added formally, but actual contents, such as lags of the self-adaptive codebook and positions and signs of the fixed codebook, are all generated randomly, resulting in a strong randomness. Therefore, the correlation between the excitation signal for background noise and the excitation signal for the previous speech frame is poor, so that the transition from a synthesized speech signal to a synthesized background noise signal is unnatural, which makes the listeners feel uncomfortable.