In coding systems used for conversational speech it is common to use discontinuous transmission (DTX) to increase the efficiency of the encoding. This is motivated by large amounts of pauses embedded in the conversational speech, e.g. while one person is talking the other one is listening. By using DTX the speech encoder can be active only about 50 percent of the time on average. Examples of codecs that have this feature are the 3GPP Adaptive Multi-Rate Narrowband (AMR NB) codec and the ITU-T G.718 codec.
In DTX operation active frames are coded in the normal codec modes, while inactive signal periods between active regions are represented with comfort noise. Signal describing parameters are extracted and encoded in the encoder and transmitted to the decoder in silence insertion description (SID) frames. The SID frames are transmitted at a reduced frame rate and a lower bit rate than used for the active speech coding mode(s). Between the SID frames no information about the signal characteristics is transmitted. Due to the low SID rate the comfort noise can only represent relatively stationary properties compared to the active signal frame coding. In the decoder the received parameters are decoded and used to characterize the comfort noise.
For high quality DTX operation, i.e. without degraded speech quality, it is important to detect the periods of speech in the input signal. This is done by using a voice activity detector (VAD) or a sound activity detector (SAD). FIG. 1 shows a block diagram of a generalized VAD, which analyses the input signal in data frames (of 5-30 ms depending on the implementation), and produces an activity decision for each frame.
A preliminary activity decision (Primary VAD Decision) is made in a primary voice detector 12 by comparison of features for the current frame estimated by a feature extractor 10 and background features estimated from previous input frames by a background estimation block 14. A difference larger than a specified threshold causes the active primary decision. In a hangover addition block 16 the primary decision is extended on the basis of past primary decisions to form the final activity decision (Final VAD Decision). The main reason for using hangover is to reduce the risk of mid and backend clipping in speech segments.
For speech codecs based on linear prediction (LP), e.g. G.718, it is reasonable to model the envelope and frame energy using a similar representation as for the active frames. This is beneficial since the memory requirements and complexity for the codec can be reduced by common functionality between the different modes in DTX operation.
For such codecs the comfort noise can be represented by its LP coefficients (also known as auto regressive (AR) coefficients) and the energy of the LP residual, i.e. the signal that as input to the LP model gives the reference audio segment. In the decoder, a residual signal is generated in the excitation generator as random noise which gets shaped by the CN parameters to form the comfort noise.
The LP coefficients are typically obtained by computing the autocorrelations r[k] of the windowed audio segments x[n], n=0, . . . , N−1 in accordance with:
                                          r            ⁡                          [              k              ]                                =                                    ∑                              n                =                k                                            N                -                1                                      ⁢                                          x                ⁡                                  [                  n                  ]                                            ⁢                              x                ⁡                                  [                                      n                    -                    k                                    ]                                                                    ,                  k          =          0                ,        …        ⁢                                  ,        P                            (        1        )            where P is the pre-defined model order. Then the LP coefficients ak are obtained from the autocorrelation sequence using e.g. the Levinson-Durbin algorithm.
In a communication system where such a codec is utilized, the LP coefficients should be efficiently transmitted from the encoder to the decoder. For this reason more compact representations that may be less sensitive to quantization noise are commonly used. For example, the LP coefficients can be transformed into linear spectral pairs (LSP). In alternative implementations the LP coefficients may instead be converted to the immitance spectrum pairs (ISP), line spectrum frequencies (LSF) or immitance spectrum frequencies (ISF) domains.
The LP residual is obtained by filtering the reference signal through an inverse LP synthesis filter A [z] defined by:
                              A          ⁡                      [            z            ]                          =                  1          +                                    ∑                              k                =                1                            P                        ⁢                                          a                k                            ⁢                              z                                  -                  k                                                                                        (        2        )            The filtered residual signal s[n] is consequently given by:
                                          s            ⁡                          [              n              ]                                =                                    x              ⁡                              [                n                ]                                      +                                          ∑                                  k                  =                  1                                P                            ⁢                                                a                  k                                ⁢                                  x                  ⁡                                      [                                          n                      -                      k                                        ]                                                                                      ,                  n          =          0                ,        …        ⁢                                  ,                  N          -          1                                    (        3        )            for which the energy is defined as:
                    E        =                              1            N                    ⁢                                    ∑                              n                =                0                                            N                -                1                                      ⁢                                          s                ⁡                                  [                  n                  ]                                            2                                                          (        4        )            Due to the low transmission rate of SID frames, the CN parameters should evolve slowly in order to not change the noise characteristics rapidly. For example, the G.718 codec limits the energy change between SID frames and interpolates the LSP coefficients to handle this.
To find representative CN parameters at the SID frames, LSP coefficients and residual energy are computed for every frame, including no data frames (thus, for no data frames the mentioned parameters are determined but not transmitted). At the SID frame the median LSP coefficients and mean residual energy are computed, encoded and transmitted to the decoder. In order for the comfort noise to not be unnaturally static, random variations may be added to the comfort noise parameters, e.g. a variation of the residual energy. This technique is for example used in the G.718 codec.
In addition, the comfort noise characteristics are not always well matched to the reference background noise, and slight attenuation of the comfort noise may reduce the listener's attention to this. The perceived audio quality can consequently become higher. In addition, the coded noise in active signal frames might have lower energy than the uncoded reference noise. Therefore attenuation may also be desirable for better energy matching of the noise representation in active and inactive frames. The attenuation is typically in the range 0-5 dB, and can be fixed or dependent on the active coding mode(s) bitrates.
In high efficient DTX systems a more aggressive VAD might be used and high energy parts of the signal (relative to the background noise level) can accordingly be represented by comfort noise. In that case, limiting the energy change between the SID frames would cause perceptual degradation. To better handle the high energy segments, the system may allow larger instant changes of CN parameters for these circumstances.
Low-pass filtering or interpolation of the CN parameters is performed at the inactive frames in order to get natural smooth comfort noise dynamics. For the first SID frame following one or several active frames (from now on just denoted the “first SID”), the best basis for LSP interpolation and energy smoothing would be the CN parameters from previous inactive frames, i.e. prior to the active signal segment.
For each inactive frame, SID or no data, the LSP vector q, can be interpolated from previous LSP coefficients according to:qi=α{tilde over (q)}SID+(1−α)qi-1  (5)where i is the frame number of inactive frames, αε[0,1] is the smoothing factor and {tilde over (q)}SID are the median LSP coefficients computed with parameters from current SID and all no data frames since the previous SID frame. For the G.718 codec a smoothing factor α=0.1 is used.
The residual energy Ei is similarly interpolated at the SID or no data frames according to:Ei=βĒSID+(1−β)Ei-1  (6)where βε[0,1] is the smoothing factor and ĒSID is the averaged energy for current SID and no data frames since the previous SID frame. For the G.718 codec a smoothing factor β=0.3 is used.
An issue with the described interpolation is that for the first SID the interpolation memories (Ei-1 and qi-1) may relate to previous high energy frames, e.g. unvoiced speech frames, which are classified as inactive by the VAD. In that case the first SID interpolation would start from noise characteristics that are not representative for the coded noise in the close active mode hangover frames. The same issue occurs if the characteristics of the background noise are changed during active signal segments, e.g. segments of a speech signal.
An example of the problems related to prior art technologies is shown in FIG. 2. The spectrogram of a noisy speech signal encoded in DTX operation shows two segments of comfort noise before and after a segment of active coded audio (such as speech). It can be seen that when the noise characteristics from the first CN segment are used for the interpolation in the first SID, there is an abrupt change of the noise characteristics. After some time the comfort noise matches the end of the active coded audio better, but the bad transition causes a clear degradation of the perceived audio quality.
Using higher smoothing factors α and β would focus the CN parameters to the characteristics of the current SID, but this could still cause problems. Since the parameters in the first SID cannot be averaged during a period of noise, as following SID frames can, the CN parameters are only based on the signal properties in the current frame. Those parameters might represent the background noise at the current frame better than the long term characteristic in the interpolation memories. It is however possible that these SID parameters are outliers, and do not represent the long term noise characteristics. That would for example result in rapid unnatural changes of the noise characteristics, and a lower perceived audio quality.