During voice transmission, speech coding techniques are generally used to compress voice message so that the capacity of a communication system may be improved.
During voice communication, speech only occupies about 40% of a time period, with the remaining time period being occupied by silence or background noise. Generally speaking, people involved in voice communication are concerned about the content of the speech only, while they are not concerned about the time period only having silence or background noise. Therefore, when voice message is being compressed, different methods are used for encoding and transmitting voice message, silence, or background noise so as to further improve the capacity of the communication system. Discontinuous Transmission System/Comfortable Noise Generation (DTX/CNG) is such a technique for further improving the capacity of the communication system.
A frame obtained by encoding the background noise with the DTX/CNG technology is generally referred to as a Silence Insertion Descriptor (SID) frame. An ordinary speech frame contains a spectral parameter, a signal energy gain parameter, as well as parameters associated with a fixed codebook and an adaptive codebook. Upon receiving a speech frame, the decoder may recover the original speech data based on such information. However, an SID frame generally only contains a spectral parameter and a signal energy gain parameter. The decoder may recover the background noise based on the spectral parameter and the signal energy gain parameter. This is due to the fact that users generally do not care what information is contained in the background noise. Accordingly, an SID frame may only deliver a small amount of reference information, i.e. the spectral parameter and the signal energy gain parameter. Based on such reference information, the decoder may recover the background noise so that the user may generally know what environment his/her counterpart is in and the listening quality experienced by the user will not be influenced obviously. During voice transmission, an SID frame is sent at an interval of several frames. A frame in which no coded parameter is sent or no parameter is coded at all may generally be referred to as a NO_DATA frame.
The DTX/CNG technology is widely applied in recent speech coding standards developed by various organizations and institutions.
The DTX/CNG technology is adopted in the speech coding standard—Adaptive Multi-Rate (AMR), developed by the Third Generation Partnership Projects (3GPP). SID frames are sent at fixed intervals, that is, every 8 frames. By using parameters decoded from two consecutively received SID frames, that is, the signal energy gain parameter and the spectral parameter, a linear interpolation is performed to estimate the parameters necessary for noise synthesis, which may be given by:
      P          n      +      k        =                              8          -          k                8            ⁢              P                  sid          ⁡                      (                          n              -              1                        )                                +                  k        8            ⁢              P                  sid          ⁡                      (            n            )                              ⁢                          ⁢              (                              k            =            1                    ,          …          ⁢                                          ,          8                )            
where Pn+k represents the estimated value of the CNG parameter for the kth frame subsequent to the nth SID frame, Psid(n−1) represents the parameter for the (n−1)th SID frame received by the decoder, and Psid(n) represents the parameter for the nth SID frame received by the decoder. When n=0, Psid(−1) represents the average value of the spectral parameters and signal energy gain parameters for the 8 speech frames in the tail period.
The DTX/CNG technology is also adopted in the speech coding standard—the silence compression scheme defined by the conjugate structure algebra code excited linear prediction speech codec, developed by the International Telecommunication Union (ITU). The encoder may determine adaptively whether to send an SID frame based on changes in the noise parameter. The interval between two consecutive SID frames should be at least 20 ms and have no maximum. The CNG algorithm used at the decoder may be given as follows.
For reconstruction of the signal energy gain parameter:
            G      ~        t    =      {                                                      G              ~                        sid_new                                                                                          7                8                            ⁢                                                G                  ~                                                  t                  -                  1                                                      +                                          1                8                            ⁢                                                G                  ~                                sid_new                                                                        if the previous frame is a speech frame;        if the previous frame is not a speech frame.        
For reconstruction of the spectral parameter:
      LSF          t      ,              sub_        ⁢        1              =      {                                                                                        1                  2                                ⁢                                  (                                                            LSF                      sid_last                                        +                                          LSF                      sid_new                                                        )                                                                                                        LSF                sid_new                                                    ⁢                                  ⁢                  LSF                      t            ,                          sub_              ⁢              2                                          =              LSF        sid_new                            if the previous frame is a speech frame;        if the previous frame is not a speech frame        
where {tilde over (G)}sid—new represents the signal energy gain parameter decoded from an SID frame newly received at the decoder, LSFsid—last represents the spectral parameter decoded from an SID frame lastly received at the decoder, and LSFsid—new represents the spectral parameter decoded from an SID frame newly received at the decoder.
The following problems appear in the conventional art.
For the speech coding standard of 3GPP—the DTX/CNG technology used in AMR, the encoder can only send SID frames at fixed intervals. If the encoder sends SID frames at adaptive intervals, the system cannot work normally.
For the speech coding standard of ITU—the DTX/CNG technology used in the silence compression scheme defined by the conjugate structure algebra code excited linear prediction vocoder, when the current frame is an SID frame, the spectrum parameter of the first sub-frame in the current frame is generated by averaging the decoded spectrum parameter in current frame and the spectrum parameter of previous SID frame, and the decoded spectral parameter is used directly as the spectral parameter for the second sub-frame. For a NO_DATA frame before the arrival of the next SID frame, the decoded spectral parameter for the latest SID frame is used directly for noise reconstruction. When the next SID frame arrives and there is a difference between the decoded spectral parameter and the spectral parameter for the previous SID frame, discontinuity may occur. Furthermore, since the spectral parameter is a variable in constant change and hence there generally is a difference between two consecutive spectral parameters, the spectrum of the reconstructed comfortable noise tends to be discontinuous, which in turn affects the listening quality, especially when there is a big difference between two consecutive spectral parameters.