The present invention relates to audio signal encoding, processing and decoding, and, in particular, to an apparatus and method for improved signal fade out for switched audio coding systems during error concealment.
In the following, the state of the art is described regarding speech and audio codecs fade out during packet loss concealment (PLC). The explanations regarding the state of the art start with the ITU-T codecs of the G-series (G.718, G.719, G.722, G.722.1, G.729. G.729.1), are followed by the 3GPP codecs (AMR, AMR-WB, AMR-WB+) and one IETF codec (OPUS), and conclude with two MPEG codecs (HE-AAC, HILN) (ITU=International Telecommunication Union; 3GPP=3rd Generation Partnership Project; AMR=Adaptive Multi-Rate; WB=Wideband; IETF=Internet Engineering Task Force). Subsequently, the state-of-the art regarding tracing the background noise level is analysed, followed by a summary which provides an overview.
At first, G.718 is considered. G.718 is a narrow-band and wideband speech codec, that supports DTX/CNG (DTX=Digital Theater Systems; CNG=Comfort Noise Generation). As embodiments particularly relate to low delay code, the low delay version mode will be described in more detail, here.
Considering ACELP (Layer 1) (ACELP=Algebraic Code Excited Linear Prediction), the ITU-T recommends for G.718 [ITU08a, section 7.11] an adaptive fade out in the linear predictive domain to control the fading speed. Generally, the concealment follows this principle:
According to G.718, in case of frame erasures, the concealment strategy can be summarized as a convergence of the signal energy and the spectral envelope to the estimated parameters of the background noise. The periodicity of the signal is converged to zero. The speed of the convergence is dependent on the parameters of the last correctly received frame and the number of consecutive erased frames, and is controlled by an attenuation factor, α. The attenuation factor α, is further dependent on the stability, θ, of the LP filter (LP=Linear Prediction) for UNVOICED frames. In general, the convergence is slow if the last good received frame is in a stable segment and is rapid if the frame is in a transition segment.
The attenuation factor α depends on the speech signal class, which is derived by signal classification described in [ITU08a, section 6.8.1.3.1 and 7.11.1.1]. The stability factor θ is computed based on a distance measure between the adjacent ISF (Immittance Spectral Frequency) filters [ITU08a, section 7.1.2.4.2].
Table 1 shows the calculation scheme of α:
TABLE 1Values of the attenuation factor α, the value θ is a stabilityfactor computed from a distance measure between theadjacent LP filters. [ITU08a, section 7.1.2.4.2].Number of successivelast good received frameerased framesαARTIFICIAL ONSET0.6ONSET, VOICED≤31.0>30.4VOICED TRANSITION0.4UNVOICED TRANSITION0.8UNVOICED=10.2 · θ + 0.8=20.6>20.4
Moreover, G.718 provides a fading method in order to modify the spectral envelope. The general idea is to converge the last ISF parameters towards an adaptive ISF mean vector. At first, an average ISF vector is calculated from the last 3 known ISF vectors. Then the average ISF vector is again averaged with an offline trained long term ISF vector (which is a constant vector) [ITU08a, section 7.11.1.2].
Moreover, G.718 provides a fading method to control the long term behavior and thus the interaction with the background noise, where the pitch excitation energy (and thus the excitation periodicity) is converging to 0, while the random excitation energy is converging to the CNG excitation energy [ITU08a, section 7.11.1.6]. The innovation gain attenuation is calculated asgs[1]=αgs[0]+(1−α)gn  (1)where gs[1] is the innovative gain at the beginning of the next frame, gs[0] is the innovative gain at the beginning of the current frame, gn is the gain of the excitation used during the comfort noise generation and the attenuation factor α.
Similarly to the periodic excitation attenuation, the gain is attenuated linearly throughout the frame on a sample-by-sample basis starting with, gs[0], and reaches gs[1] at the beginning of the next frame.
FIG. 2 outlines the decoder structure of G.718. In particular, FIG. 2 illustrates a high level G.718 decoder structure for PLC, featuring a high pass filter.
By the above-described approach of G.718, the innovative gain gs converges to the gain used during comfort noise generation gn for long bursts of packet losses. As described in [ITU08a, section 6.12.3], the comfort noise gain gn is given as the square root of the energy {tilde over (E)}. The conditions of the update of {tilde over (E)} are not described in detail. Following the reference implementation (floating point C-code, stat_noise_uv_mod.c), {tilde over (E)} is derived as follows:
if(unvoiced_vad == 0){if( unv_cnt > 20 ){ftmp = lp_gainc * lp_gainc;lp_ener = 0.7f * lp_ener + 0.3f * ftmp;}else{unv_cnt++;}}else{unv_cnt = 0;}wherein unvoiced_vad holds the voice activity detection, wherein unv_cnt holds the number of unvoiced frames in a row, wherein lp_gainc holds the low passed gains of the fixed codebook, and wherein lp_ener holds the low passed CNG energy estimate {tilde over (E)}, it is initialized with 0.
Furthermore, G.718 provides a high pass filter, introduced into the signal path of the unvoiced excitation, if the signal of the last good frame was classified different from UNVOICED, see FIG. 2, also see [ITU08a, section 7.11.1.6]. This filter has a low shelf characteristic with a frequency response at DC being around 5 dB lower than at Nyquist frequency.
Moreover, G.718 proposes a decoupled LTP feedback loop (LTP=Long-Term Prediction): While during normal operation the feedback loop for the adaptive codebook is updated subframe-wise ([ITU08a, section 7.1.2.1.4]) based on the full excitation. During concealment this feedback loop is updated frame-wise (see [ITU08a, sections 7.11.1.4, 7.11.2.4, 7.11.1.6, 7.11.2.6; dec_GV_exc@dec_gen_voic.c and syn_bfi_post@syn_bfi_pre_post.c]) based on the voiced excitation only. With this approach, the adaptive codebook is not “polluted” with noise having its origin in by the randomly chosen innovation excitation.
Regarding the transform coded enhancement layers (3-5) of G.718, during concealment, the decoder behaves regarding the high layer decoding similar to the normal operation, just that the MDCT spectrum is set to zero. No special fade-out behavior is applied during concealment.
With respect to CNG, in G.718, the CNG synthesis is done in the following order. At first, parameters of a comfort noise frame are decoded. Then, a comfort noise frame is synthesized. Afterwards the pitch buffer is reset. Then, the synthesis for the FER (Frame Error Recovery) classification is saved. Afterwards, spectrum deemphasis is conducted. Then low frequency post-filtering is conducted. Then, the CNG variables are updated.
In the case of concealment, exactly the same is performed, except the CNG parameters are not decoded from the bitstream. This means that the parameters are not updated during the frame loss, but the decoded parameters from the last good SID (Silence Insertion Descriptor) frame are used.
Now, G.719 is considered. G.719, which is based on Siren 22, is a transform based full-band audio codec. The ITU-T recommends for G.719 a fade-out with frame repetition in the spectral domain [ITU08b, section 8.6]. According to G.719, a frame erasure concealment mechanism is incorporated into the decoder. When a frame is correctly received, the reconstructed transform coefficients are stored in a buffer. If the decoder is informed that a frame has been lost or that a frame is corrupted, the transform coefficients reconstructed in the most recently received frame are decreasingly scaled with a factor 0.5 and then used as the reconstructed transform coefficients for the current frame. The decoder proceeds by transforming them to the time domain and performing the windowing-overlap-add operation.
In the following, G.722 is described. G.722 is a 50 to 7000 Hz coding system which uses subband adaptive differential pulse code modulation (SB-ADPCM) within a bitrate up to 64 kbit/s. The signal is split into a higher and a lower subband, using a QMF analysis (QMF=Quadrature Mirror Filter). The resulting two bands are ADPCM-coded (ADPCM=Adaptive Differential Pulse Code Modulation).
For G.722, a high-complexity algorithm for packet loss concealment is specified in Appendix III [ITU06a] and a low-complexity algorithm for packet loss concealment is specified in Appendix IV [ITU07]. G.722—Appendix III ([ITU06a, section 111.5]) proposes a gradually performed muting, starting after 20 ms of frame-loss, being completed after 60 ms of frame-loss. Moreover, G.722—Appendix IV proposes a fade-out technique which applies “to each sample a gain factor that is computed and adapted sample by sample” [ITU07, section IV.6.1.2.7].
In G.722, the muting process takes place in the subband domain just before the QMF synthesis and as the last step of the PLC module. The calculation of the muting factor is performed using class information from the signal classifier which also is part of the PLC module. The distinction is made between classes TRANSIENT, UV_TRANSITION and others. Furthermore, distinction is made between single losses of 10-ms frames and other cases (multiple losses of 10-ms frames and single/multiple losses of 20-ms frames).
This is illustrated by FIG. 3. In particular, FIG. 3 depicts a scenario, where the fade-out factor of G.722, depends on class information and wherein 80 samples are equivalent to 10 ms.
According to G.722, the PLC module creates the signal for the missing frame and some additional signal (10 ms) which is supposed to be cross-faded with the next good frame. The muting for this additional signal follows the same rules. In highband concealment of G.722, cross-fading does not take place.
In the following, G.722.1 is considered. G.722.1, which is based on Siren 7, is a transform based wide band audio codec with a super wide band extension mode, referred to as G.722.1C. G. 722.1C itself is based on Siren 14. The ITU-T recommends for G.722.1 a frame-repetition with subsequent muting [ITU05, section 4.7]. If the decoder is informed, by means of an external signaling mechanism not defined in this recommendation, that a frame has been lost or corrupted, it repeats the previous frame's decoded MLT (Modulated Lapped Transform) coefficients. It proceeds by transforming them to the time domain, and performing the overlap and add operation with the previous and next frame's decoded information. If the previous frame was also lost or corrupted, then the decoder sets all the current frames MLT coefficients to zero.
Now, G.729 is considered. G.729 is an audio data compression algorithm for voice that compresses digital voice in packets of 10 milliseconds duration. It is officially described as Coding of speech at 8 kbit/s using code-excited linear prediction speech coding (CS-ACELP) [ITU12].
As outlined in [CPK08], G.729 recommends a fade-out in the LP domain. The PLC algorithm employed in the G.729 standard reconstructs the speech signal for the current frame based on previously-received speech information. In other words, the PLC algorithm replaces the missing excitation with an equivalent characteristic of a previously received frame, though the excitation energy gradually decays finally, the gains of the adaptive and fixed codebooks are attenuated by a constant factor.
The attenuated fixed-codebook gain is given by:gc(m)=0.98·gc(m-1) with m is the subframe index.
The adaptive-codebook gain is based on an attenuated version of the previous adaptive-codebook gain:gp(m)=0.9·gp(m-1), bounded by gp(m)<0.9
Nam in Park et al. suggest for G.729, a signal amplitude control using prediction by means of linear regression [CPK08, PKJ+11]. It is addressed to burst packet loss and uses linear regression as a core technique. Linear regression is based on the linear model asg′i=a+bi  (2)where g′i is the newly predicted current amplitude, a and b are coefficients for the first order linear function, and i is the index of the frame. In order to find the optimized coefficients a* and b*, the summation of the squared prediction error is minimized:
                    ϵ        =                              ∑                          j              =                              i                -                4                                                    i              -              1                                ⁢                                    (                                                g                  j                                -                                  g                  j                  ′                                            )                        2                                              (        3        )            
ϵ is the squared error, gj is the original past j-th amplitude. To minimize this error, simply the derivative regarding a and b is set to zero. By using the optimized parameters a* and b*, an estimate of each g*i is denoted byg*i=a*+b*i  (4)
FIG. 4 shows the amplitude prediction, in particular, the prediction of the amplitude g*i, by using linear regression.
To obtain the amplitude A′i of the lost packet i, a ratio σi
                              σ          i                =                              g            i            *                                g                          i              -              1                                                          (        5        )            is multiplied with a scale factor Si:A′i=Si*σi  (6)wherein the scale factor Si depends on the number of consecutive concealed frames l(i):
                              S          i                =                  {                                                                      1.0                  ,                                                                                                                        if                      ⁢                                                                                          ⁢                                              l                        ⁡                                                  (                          i                          )                                                                                      =                    1                                    ,                  2                                                                                                      0.9                  ,                                                                                                                        if                      ⁢                                                                                          ⁢                                              l                        ⁡                                                  (                          i                          )                                                                                      =                    3                                    ,                  4                                                                                                      0.8                  ,                                                                                                                        if                      ⁢                                                                                          ⁢                                              l                        ⁡                                                  (                          i                          )                                                                                      =                    5                                    ,                  6                                                                                                      0                  ,                                                            otherwise                                                                        (        7        )            
In [PKJ+11], a slightly different scaling is proposed.
According to G.729, afterwards, A′i will be smoothed to prevent discrete attenuation at frame borders. The final, smoothed amplitude Ai (n) is multiplied to the excitation, obtained from the previous PLC components.
In the following, G.729.1 is considered. G.729.1 is a G.729-based embedded variable bit-rate coder: An 8-32 kbit/s scalable wideband coder bitstream inter-operable with G.729 [ITU06b].
According to G.729.1, as in G.718 (see above), an adaptive fade out is proposed, which depends on the stability of the signal characteristics ([ITU06b, section 7.6.1]). During concealment, the signal is usually attenuated based on an attenuation factor α which depends on the parameters of the last good received frame class and the number of consecutive erased frames. The attenuation factor α is further dependent on the stability of the LP filter for UNVOICED frames. In general, the attenuation is slow if the last good received frame is in a stable segment and is rapid if the frame is in a transition segment.
Furthermore, the attenuation factor α depends on the average pitch gain per subframe gp ([ITU06b, eq. 163, 164]):gp=0.1gp(0)+0.2gp(1)+0.3gp(2)+0.4gp(3)  (8)where gp(i) is the pitch gain in subframe i.
Table 2 shows the calculation scheme of α, whereβ=√{square root over (gp)} with 85≥β≥0.98  (9)
During the concealment process, α is used in the following concealment tools:
TABLE 2Values of the attenuation factor α, the value θ is a stabilityfactor computed from a distance measure between theadjacent LP filters. [ITU06b, section 7.6.1].Number ofsuccessivelast good received frameerased framesαVOICED  1 2, 3 >3                       β                                                  g            _                    p                       0.4 ONSET  1 2, 3 >3                                             0.8            ⁢                                                  ⁢            β                                                                          g              _                        p                                      0.4 ARTIFICIAL ONSET  1 2, 3 >3                                 0.6          ⁢                                          ⁢          β                                                          g            _                    p                       0.4 VOICED TRANSITION≤20.8>20.2UNVOICED TRANSTION 0.88UNVOICED  1 0.95  2.30.6 θ + 0.4>30.4
According to G.729.1, regarding glottal pulse resynchronization, as the last pulse of the excitation of the previous frame is used for the construction of the periodic part, its gain is approximately correct at the beginning of the concealed frame and can be set to 1. The gain is then attenuated linearly throughout the frame on a sample-by-sample basis to achieve the value of a at the end of the frame. The energy evolution of voiced segments is extrapolated by using the pitch excitation gain values of each subframe of the last good frame. In general, if these gains are greater than 1, the signal energy is increasing, if they are lower than 1, the energy is decreasing. α is thus set to β=√{square root over (gp)} as described above, see [ITU06b, eq. 163, 164]. The value of β is clipped between 0.98 and 0.85 to avoid strong energy increases and decreases, see [ITU06b, section 7.6.4].
Regarding the construction of the random part of the excitation, according to G.729.1, at the beginning of an erased block, the innovation gain gs is initialized by using the innovation excitation gains of each subframe of the last good frame:gs=0.1g(0)+0.2g(1)+0.3g(2)+0.4g(3) wherein g(0), g(1), g(2) and g(3) are the fixed codebook, or innovation, gains of the four subframes of the last correctly received frame. The innovation gain attenuation is done as:gs(1)=α·gs(0) wherein gs(1) is the innovation gain at the beginning of the next frame, gs(0) is the innovation gain at the beginning of the current frame, and α is as defined in Table 2 above. Similarly to the periodic excitation attenuation, the gain is thus linearly attenuated throughout the frame on a sample by sample basis starting with gs(0) and going to the value of gs(1) that would be achieved at the beginning of the next frame.
According, to G.729.1, if the last good frame is UNVOICED, only the innovation excitation is used and it is further attenuated by a factor of 0.8. In this case, the past excitation buffer is updated with the innovation excitation as no periodic part of the excitation is available, see [ITU06b, section 7.6.6].
In the following, AMR is considered. 3GPP AMR [3GP12b] is a speech codec utilizing the ACELP algorithm. AMR is able to code speech with a sampling rate of 8000 samples/s and a bitrate between 4.75 and 12.2 kbit/s and supports signaling silence descriptor frames (DTX/CNG).
In AMR, during error concealment (see [3GP12a]), it is distinguished between frames which are error prone (bit errors) and frames, that are completely lost (no data at all).
For ACELP concealment, AMR introduces a state machine which estimates the quality of the channel: The larger the value of the state counter, the worse the channel quality is. The system starts in state 0. Each time a bad frame is detected, the state counter is incremented by one and is saturated when it reaches 6. Each time a good speech frame is detected, the state counter is reset to zero, except when the state is 6, where the state counter is set to 5. The control flow of the state machine can be described by the following C code (BFI is a bad frame indicator, State is a state variable):
if(BFI != 0 ) {State = State + 1;}else if(State == 6) {State = 5;}else {State = 0;}if(State > 6 ) {State = 6;}
In addition to this state machine, in AMR, the bad frame flags from the current and the previous frames are checked (prevBFI).
Three different combinations are possible:
The first one of the three combinations is BFI=0, prevBFI=0, State=0: No error is detected in the received or in the previous received speech frame. The received speech parameters are used in the normal way in the speech synthesis. The current frame of speech parameters is saved.
The second one of the three combinations is BFI=0, prevBFI=1, State=0 or 5: No error is detected in the received speech frame, but the previous received speech frame was bad. The LTP gain and fixed codebook gain are limited below the values used for the last received good subframe:
                              g          p                =                  {                                                                                          g                    p                                    ,                                                                                                  g                    p                                    ≤                                                            g                      p                                        ⁡                                          (                                              -                        1                                            )                                                                                                                                                                                      g                      p                                        ⁡                                          (                                              -                        1                                            )                                                        ,                                                                                                  g                    p                                    >                                                            g                      p                                        ⁡                                          (                                              -                        1                                            )                                                                                                                              (        10        )            where gp=current decoded LTP gain, gp(−1)=LTP gain used for the last good subframe (BFI=0), and
                              g          c                =                  {                                                                                          g                    c                                    ,                                                                                                  g                    c                                    ≤                                                            g                      c                                        ⁡                                          (                                              -                        1                                            )                                                                                                                                                                                      g                      c                                        ⁡                                          (                                              1                        -                                            )                                                        ,                                                                                                  g                    c                                    >                                                            g                      c                                        ⁡                                          (                                              -                        1                                            )                                                                                                                              (        11        )            where gc=current decoded fixed codebook gain, and gc(−1)=fixed codebook gain used for the last good subframe (BFI=0).
The rest of the received speech parameters are used normally in the speech synthesis. The current frame of speech parameters is saved.
The third one of the three combinations is BFI=1, prevBFI=0 or 1, State=1 . . . 6: An error is detected in the received speech frame and the substitution and muting procedure is started. The LTP gain and fixed codebook gain are replaced by attenuated values from the previous subframes:
                    ⁢          (      12      )                  g      p        =          {                                                                                    P                  ⁡                                      (                    state                    )                                                  ·                                                      g                    p                                    ⁡                                      (                                          -                      1                                        )                                                              ,                                                                                            g                  p                                ⁡                                  (                                      -                    1                                    )                                            ≤                              median                ⁢                                                                  ⁢                5                ⁢                                  (                                                                                    g                        p                                            ⁡                                              (                                                  -                          1                                                )                                                              ,                    …                    ⁢                                                                                  ,                                                                  g                        p                                            ⁡                                              (                                                  -                          5                                                )                                                                              )                                                                                                                                                              P                    ⁢                                                                  (                        state                        )                                            ·                                                                                                                                        median                    ⁢                                                                                  ⁢                    5                    ⁢                                          (                                                                                                    g                            p                                                    ⁡                                                      (                                                          -                              1                                                        )                                                                          ,                        …                        ⁢                                                                                                  ,                                                                              g                            p                                                    ⁡                                                      (                                                          -                              5                                                        )                                                                                              )                                                                                                                                                                g                  p                                ⁡                                  (                                      -                    1                                    )                                            >                              median                ⁢                                                                  ⁢                5                ⁢                                  (                                                                                    g                        p                                            ⁡                                              (                                                  -                          1                                                )                                                              ,                    …                    ⁢                                                                                  ,                                                                  g                        p                                            ⁡                                              (                                                  -                          5                                                )                                                                              )                                                                        where gp indicates the current decoded LTP gain and gp(−1), . . . , gp(−n) indicate the LTP gains used for the last n subframes and median5( ) indicates a 5-point median operation andP(state)=attenuation factor,where (P(1)=0.98, P(2)=0.98, P(3)=0.8, P(4)=0.3, P(5)=0.2, P(6)=0.2) and state=state number, and
                    ⁢          (      13      )                  g      c        =          {                                                                                    C                  ⁡                                      (                    state                    )                                                  ·                                                      g                    c                                    ⁡                                      (                                          -                      1                                        )                                                              ,                                                                                            g                  c                                ⁡                                  (                                      -                    1                                    )                                            ≤                              median                ⁢                                                                  ⁢                5                ⁢                                  (                                                                                    g                        c                                            ⁡                                              (                                                  -                          1                                                )                                                              ,                    …                    ⁢                                                                                  ,                                                                  g                        c                                            ⁡                                              (                                                  -                          5                                                )                                                                              )                                                                                                                                                                                    C                      ⁡                                              (                        state                        )                                                              ·                                                                                                                    median                    ⁢                                                                                  ⁢                    5                    ⁢                                          (                                                                                                    g                            c                                                    ⁡                                                      (                                                          -                              1                                                        )                                                                          ,                        …                        ⁢                                                                                                  ,                                                                              g                            c                                                    ⁡                                                      (                                                          -                              5                                                        )                                                                                              )                                                                                                                                                                g                  c                                ⁡                                  (                                      -                    1                                    )                                            >                              median                ⁢                                                                  ⁢                5                ⁢                                  (                                                                                    g                        c                                            ⁡                                              (                                                  -                          1                                                )                                                              ,                    …                    ⁢                                                                                  ,                                                                  g                        c                                            ⁡                                              (                                                  -                          5                                                )                                                                              )                                                                        where gc indicates the current decoded fixed codebook gain and gc(−1), . . . gc (−n) indicate the fixed codebook gains used for the last n subframes and median5( ) indicates a 5-point median operation and C(state)=attenuation factor, where (C(1)=0.98, C(2)=0.98, C(3)=0.98, C(4)=0.98, C(5)=0.98, C(6)=0.7) and state=state number.
In AMR, the LTP-lag values (LTP=Long-Term Prediction) are replaced by the past value from the 4th subframe of the previous frame (12.2 mode) or slightly modified values based on the last correctly received value (all other modes).
According to AMR, the received fixed codebook innovation pulses from the erroneous frame are used in the state in which they were received when corrupted data are received. In the case when no data were received random fixed codebook indices should be employed.
Regarding CNG in AMR, according to [3GP12a, section 6.4], each first lost SID frame is substituted by using the SID information from earlier received valid SID frames and the procedure for valid SID frames is applied. For subsequent lost SID frames, an attenuation technique is applied to the comfort noise that will gradually decrease the output level. Therefore it is checked if the last SID update was more than 50 frames (=1 s) ago, if yes, the output will be muted (level attenuation by − 6/8 dB per frame [3GP12d, dtx_dec{ }@sp_dec.c] which yields 37.5 dB per second). Note that the fade-out applied to CNG is performed in the LP domain.
In the following, AMR-WB is considered. Adaptive Multirate-WB [ITU03, 3GP09c] is a speech codec, ACELP, based on AMR (see section 1.8). It uses parametric bandwidth extension and also supports DTX/CNG. In the description of the standard [3GP12g] there are concealment example solutions given which are the same as for AMR [3GP12a] with minor deviations. Therefore, just the differences to AMR are described here. For the standard description, see the description above.
Regarding ACELP, in AMR-WB, the ACELP fade-out is performed based on the reference source code [3GP12c] by modifying the pitch gain gp (for AMR above referred to as LTP gain) and by modifying the code gain gc.
In case of lost frame, the pitch gain gp for the first subframe is the same as in the last good frame, except that it is limited between 0.95 and 0.5. For the second, the third and the following subframes, the pitch gain gp is decreased by a factor of 0.95 and again limited.
AMR-WB proposes that in a concealed frame, gc is based on the last gc:
                              g                      c            ,            current                          =                              g                          c              ,              past                                *                      (                          1.4              -                              g                                  p                  ,                  past                                                      )                                              (        14        )                                          g          c                =                              g                          c              ,              current                                *                      g                          c              inov                                                          (        15        )                                          g                      c            inov                          =                  1.0                                                    ener                inov                            subframe_size                                                          (        16        )                                          ener          inov                =                              ∑                          i              =              0                                      subframe_size              -              1                                ⁢                      code            ⁡                          [              i              ]                                                          (        17        )            
For concealing the LTP-lags, in AMR-WB, the history of the five last good LTP-lags and LTP-gains are used for finding the best method to update, in case of a frame loss. In case the frame is received with bit errors a prediction is performed, whether the received LTP lag is usable or not [3GP12g].
Regarding CNG, in AMR-WB, if the last correctly received frame was a SID frame and a frame is classified as lost, it shall be substituted by the last valid SID frame information and the procedure for valid SID frames should be applied.
For subsequent lost SID frames, AMR-WB proposes to apply an attenuation technique to the comfort noise that will gradually decrease the output level. Therefore it is checked if the last SID update was more than 50 frames (=1 s) ago, if yes, the output will be muted (level attenuation by −⅜ dB per frame [3GP12f, dtx_dec{ }@dtx.c] which yields 18.75 dB per second). Note that the fade-out applied to CNG is performed in the LP domain.
Now, AMR-WB+ is considered. Adaptive Multirate-WB+ [3GP09a] is a switched codec using ACELP and TCX (TCX=Transform Coded Excitation) as core codecs. It uses parametric bandwidth extension and also supports DTX/CNG.
In AMR-WB+, a mode extrapolation logic is applied to extrapolate the modes of the lost frames within a distorted superframe. This mode extrapolation is based on the fact that there exists redundancy in the definition of mode indicators. The decision logic (given in [3GP09a, FIG. 18]) proposed by AMR-WB+ is as follows:                A vector mode, (m−1, m0, m1, m2, m3), is defined, where m−1 indicates the mode of the last frame of the previous superframe and m0, m1, m2, m3 indicate the modes of the frames in the current superframe (decoded from the bitstream), where mk=−1, 0, 1, 2 or 3 (−1: lost, 0: ACELP, 1: TCX20, 2: TCX40, 3: TCX80), and where the number of lost frames nloss may be between 0 and 4.        If m−1=3 and two of the mode indicators of the frames 0-3 are equal to three, all indicators will be set to three because then it is for sure that one TCX80 frame was indicated within the superframe.        If only one indicator of the frames 0-3 is three (and the number of lost frames nloss is three), the mode will be set to (1, 1, 1, 1), because then ¾ of the TCX80 target spectrum is lost and it is very likely that the global TCX gain is lost.        If the mode is indicating (x, 2, −1, x, x) or (x, −1, 2, x, x), it will be extrapolated to (x, 2, 2, x, x), indicating a TCX40 frame. If the mode indicates (x, x, x, 2, −1) or (x, x, −1, 2) it will be extrapolated to (x, x, x, 2, 2), also indicating a TCX40 frame. It should be noted that (x, [0, 1], 2, 2, [0, 1]) are invalid configurations.        After that, for each frame that is lost (mode=−1), the mode is set to ACELP (mode=0) if the preceding frame was ACELP and the mode is set to TCX20 (mode=1) for all other cases.        
Regarding ACELP, according to AMR-WB+, if a lost frames mode results in mk=0 after the mode extrapolation, the same approach as in [3GP12g] is applied for this frame (see above).
In AMR-WB+, depending on the number of lost frames and the extrapolated mode, the following TCX related concealment approaches are distinguished (TCX=Transform Coded Excitation):                If a full frame is lost, then an ACELP like concealment is applied: The last excitation is repeated and concealed ISF coefficients (slightly shifted towards their adaptive mean) are used to synthesize the time domain signal. Additionally, a fade-out factor of 0.7 per frame (20 ms) [3GP09b, dec_tcx.c] is multiplied in the linear predictive domain, right before the LPC (Linear Predictive Coding) synthesis.        If the last mode was TCX80 as well as the extrapolated mode of the (partially lost) superframe is TCX80 (nloss=[1, 2], mode=(3, 3, 3, 3, 3)), concealment is performed in the FFT domain, utilizing phase and amplitude extrapolation, taking the last correctly received frame into account. The extrapolation approach of the phase information is not of any interest here (no relation to fading strategy) and therefore not described. For further details, see [3GP09a, section 6.5.1.2.4]. With respect to the amplitude modification of AMR-WB+, the approach performed for TCX concealment consists of the following steps [3GP09a, section 6.5.1.2.3]:        The previous frame magnitude spectrum is computed:oldA[k]=|old{circumflex over (X)}[k]|        The current frame magnitude spectrum is computed:A[k]=|{circumflex over (X)}[k]|        The gain difference of energy of non-lost spectral coefficients between the previous and the current frame is computed:        
  gain  =                    ∑                              A            ⁡                          [              k              ]                                2                            ∑                  old          ⁢                                          ⁢                                    A              ⁡                              [                k                ]                                      2                                              The amplitude of the missing spectral coefficients is extrapolated using:if(lost[k])A[k]=gain·oldA[k]        In every other case of a lost frame with mk=[2, 3], the TCX target (inverse FFT of decoded spectrum plus noise fill-in (using a noise level decoded from the bitstream)) is synthesized using all available info (including global TCX gain). No fade-out is applied in this case.        
Regarding CNG in AMR-WB+, the same approach as in AMR-WB is used (see above).
In the following, OPUS is considered. OPUS [IET12] incorporates technology from two codecs: the speech-oriented SILK (known as the Skype codec) and the low-latency CELT (CELT=Constrained-Energy Lapped Transform). Opus can be adjusted seamlessly between high and low bitrates, and internally, it switches between a linear prediction codec at lower bitrates (SILK) and a transform codec at higher bitrates (CELT) as well as a hybrid for a short overlap.
Regarding SILK audio data compression and decompression, in OPUS, there are several parameters which are attenuated during concealment in the SILK decoder routine. The LTP gain parameter is attenuated by multiplying all LPC coefficients with either 0.99, 0.95 or 0.90 per frame, depending on the number of consecutive lost frames, where the excitation is built up using the last pitch cycle from the excitation of the previous frame. The pitch lag parameter is very slowly increased during consecutive losses. For single losses it is kept constant compared to the last frame. Moreover, the excitation gain parameter is exponentially attenuated with 0.99lostcnt per frame, so that the excitation gain parameter is 0.99 for the first excitation gain parameter, so that the excitation gain parameter is 0.992 for the second excitation gain parameter, and so on. The excitation is generated using a random number generator which is generating white noise by variable overflow. Furthermore, the LPC coefficients are extrapolated/averaged based on the last correctly received set of coefficients. After generating the attenuated excitation vector, the concealed LPC coefficients are used in OPUS to synthesize the time domain output signal.
Now, in the context of OPUS, CELT is considered. CELT is a transform based codec. The concealment of CELT features a pitch based PLC approach, which is applied for up to five consecutively lost frames. Starting with frame 6, a noise like concealment approach is applied, which generating background noise, which characteristic is supposed to sound like preceding background noise.
FIG. 5 illustrates the burst loss behavior of CELT. In particular, FIG. 5 depicts a spectrogram (x-axis: time; y-axis: frequency) of a CELT concealed speech segment. The light grey box indicates the first 5 consecutively lost frames, where the pitch based PLC approach is applied. Beyond that, the noise like concealment is shown. It should be noted that the switching is performed instantly, it does not transit smoothly.
Regarding pitch based concealment, in OPUS, the pitch based concealment consists of finding the periodicity in the decoded signal by autocorrelation and repeating the windowed waveform (in the excitation domain using LPC analysis and synthesis) using the pitch offset (pitch lag). The windowed waveform is overlapped in such a way as to preserve the time-domain aliasing cancellation with the previous frame and the next frame [IET12]. Additionally a fade-out factor is derived and applied by the following code:
opus_val32 E1=1, E2=1;int period;if (pitch_index <= MAX_PERIOD/2) {period = pitch_index;}else {period = MAX_PERIOD/2;}for (i=0;i<period;i++){E1 += exc[MAX_PERIOD− period+i] * exc[MAX_PERIOD−period+i];E2 += exc[MAX_PERIOD−2*period+i] *exc[MAX_PERIOD−2*period+i];}if (E1 > E2) {E1 = E2;}decay = sqrt(E1/E2));attenuation = decay;
In this code, exc contains the excitation signal up to MAX_PERIOD samples before the loss.
The excitation signal is later multiplied with attenuation, then synthesized and output via LPC synthesis.
The fading algorithm for the time domain approach can be summarized like this:                Find the pitch synchronous energy of the last pitch cycle before the loss.        Find the pitch synchronous energy of the second last pitch cycle before the loss.        If the energy is increasing, limit it to stay constant: attenuation=1        If the energy is decreasing, continue with the same attenuation during concealment.        
Regarding noise like concealment, according to OPUS, for the 6th and following consecutive lost frames a noise substitution approach in the MDCT domain is performed, in order to simulate comfort background noise.
Regarding tracing of the background noise level and shape, in OPUS, the background noise estimate is performed as follows: After the MDCT analysis, the square root of the MDCT band energies is calculated per frequency band, where the grouping of the MDCT bins follows the bark scale according to [IET12, Table 55]. Then the square root of the energies is transformed into the log2 domain by:band Log E[i]=log2(e)·loge(bandE[i]−eMeans[i]) for i=0 . . . 21  (18)wherein e is the Euler's number, bandE is the square root of the MDCT band and eMeans is a vector of constants (necessitated to get the result zero mean, which results in an enhanced coding gain).
In OPUS, the background noise is logged on the decoder side like this [IET12, amp2 Log 2 and log 2Amp@quant_bands.c]:background Log E[i]=min(background Log E[i]=8·0.001,band Log E[i]) for i=0 . . . 21  (19)
The traced minimum energy is basically determined by the square root of the energy of the band of the current frame, but the increase from one frame to the next is limited by 0.05 dB.
Regarding the application of the background noise level and shape, according to OPUS, if the noise like PLC is applied, background Log E as derived in the last good frame is used and converted back to the linear domain:bandE[i]=e(loge(2)·(background Log E[i]+eMeans[i])) for i=0 . . . 21  (20)where e is the Euler's number and eMeans is the same vector of constants as for the “linear to log” transform.
The current concealment procedure is to fill the MDCT frame with white noise produced by a random number generator, and scale this white noise in a way that it matches band wise to the energy of bandE. Subsequently, the inverse MDCT is applied which results in a time domain signal. After the overlap add and deemphasis (like in regular decoding) it is put out.
In the following, MPEG-4 HE-AAC is considered (MPEG=Moving Picture Experts Group; HE-AAC=High Efficiency Advanced Audio Coding). High Efficiency Advanced Audio Coding consists of a transform based audio codec (AAC), supplemented by a parametric bandwidth extension (SBR).
Regarding AAC (AAC=Advanced Audio Coding), the DAB consortium specifies for AAC in DAB+, a fade-out to zero in the frequency domain [EBU10, section A1.2] (DAB=Digital Audio Broadcasting). Fade-out behavior, e.g., the attenuation ramp, might be fixed or adjustable by the user. The spectral coefficients from the last AU (AU=Access Unit) are attenuated by a factor corresponding to the fade-out characteristics and then passed to the frequency-to-time mapping. Depending on the attenuation ramp, the concealment switches to muting after a number of consecutive invalid AUs, which means the complete spectrum will be set to 0.
The DRM (DRM=Digital Rights Management) consortium specifies for AAC in DRM a fade-out in the frequency domain [EBU12, section 5.3.3]. Concealment works on the spectral data just before the final frequency to time conversion. If multiple frames are corrupted, concealment implements first a fadeout based on slightly modified spectral values from the last valid frame. Moreover, similar to DAB+, fade-out behavior, e.g., the attenuation ramp, might be fixed or adjustable by the user. The spectral coefficients from the last frame are attenuated by a factor corresponding to the fade-out characteristics and then passed to the frequency to-time mapping. Depending on the attenuation ramp, the concealment switches to muting after a number of consecutive invalid frames, which means the complete spectrum will be set to 0.
3GPP introduces for AAC in Enhanced aacPlus the fade-out in the frequency domain similar to DRM [3GP12e, section 5.1]. Concealment works on the spectral data just before the final frequency to time conversion. If multiple frames are corrupted, concealment implements first a fadeout based on slightly modified spectral values from the last good frame. A complete fading out takes 5 frames. The spectral coefficients from the last good frame are copied and attenuated by a factor of:fadeOutFac=2−(nFadeOutFrame/2) with nFadeOutFrame as frame counter since the last good frame. After five frames of fading out the concealment switches to muting, that means the complete spectrum will be set to 0.
Lauber and Sperschneider introduce for AAC a frame-wise fade-out of the MDCT spectrum, based on energy extrapolation [LS01, section 4.4]. Energy shapes of a preceding spectrum might be used to extrapolate the shape of an estimated spectrum. Energy extrapolation can be performed independent of the concealment techniques as a kind of post concealment.
Regarding AAC, the energy calculation is performed on a scale factor band basis in order to be close to the critical bands of the human auditory system. The individual energy values are decreased on a frame by frame basis in order to reduce the volume smoothly, e.g., to fade out the signal. This is necessitated since the probability, that the estimated values represent the current signal, decreases rapidly over time.
For the generation of the spectrum to be fed out they suggest frame repetition or noise substitution [LS01, sections 3.2 and 3.3].
Quackenbusch and Driesen suggest for AAC an exponential frame-wise fade-out to zero [QD03]. A repetition of adjacent set of time/frequency coefficients is proposed, wherein each repetition has exponentially increasing attenuation, thus fading gradually to mute in the case of extended outages.
Regarding SBR (SBR=Spectral Band Replication) in MPEG-4 HE-AAC, 3GPP suggests for SBR in Enhanced aacPlus to buffer the decoded envelope data and, in case of a frame loss, to reuse the buffered energies of the transmitted envelope data and to decrease them by a constant ratio of 3 dB for every concealed frame. The result is fed into the normal decoding process where the envelope adjuster uses it to calculate the gains, used for adjusting the patched highbands created by the HF generator. SBR decoding then takes place as usual. Moreover, the delta coded noise floor and sine level values are being deleted. As no difference to the previous information remains available, the decoded noise floor and sine levels remain proportional to the energy of the HF generated signal [3GP12e, section 5.2].
The DRM consortium specified for SBR in conjunction with AAC the same technique as 3GPP [EBU12, section 5.6.3.1]. Moreover, The DAB consortium specifies for SBR in DAB+ the same technique as 3GPP [EBU10, section A2].
In the following, MPEG-4 CELP and MPEG-4 HVXC (HVXC=Harmonic Vector Excitation Coding) are considered. The DRM consortium specifies for SBR in conjunction with CELP and HVXC [EBU12, section 5.6.3.2] that the minimum requirement concealment for SBR for the speech codecs is to apply a predetermined set of data values, whenever a corrupted SBR frame has been detected. Those values yield a static highband spectral envelope at a low relative playback level, exhibiting a roll-off towards the higher frequencies. The objective is simply to ensure that no ill-behaved, potentially loud, audio bursts reach the listner's ears, by means of inserting “comfort noise” (as opposed to strict muting). This is in fact no real fade-out but rather a jump to a certain energy level in order to insert some kind of comfort noise.
Subsequently, an alternative is mentioned [EBU12, section 5.6.3.2] which reuses the last correctly decoded data and slowly fading the levels (L) towards 0, analogously to the AAC+SBR case.
Now, MPEG-4 HILN is considered (HILN=Harmonic and Individual Lines plus Noise). Meine et al. introduce a fade-out for the parametric MPEG-4 HILN codec [ISO09] in a parametric domain [MEP01]. For continued harmonic components a good default behavior for replacing corrupted differentially encoded parameters is to keep the frequency constant, to reduce the amplitude by an attenuation factor (e.g., −6 dB), and to let the spectral envelope converge towards that of the averaged low-pass characteristic. An alternative for the spectral envelope would be to keep it unchanged. With respect to amplitudes and spectral envelopes, noise components can be treated the same way as harmonic components.
In the following, tracing of the background noise level in known technology is considered. Rangachari and Loizou [RL06] provide a good overview of several methods and discuss some of their limitations. Methods for tracing the background noise level are, e.g., minimum tracking procedure [RL06] [Coh03] [SFB00] [Dob95], VAD based (VAD=voice activity detection); Kalman filtering [Gan05] [BJH06], subspace decompositions [BP06] [HJH08]; Soft Decision [SS98] [MPC89] [HE95], and minimum statistics.
The minimum statistics approach was chosen to be used within the scope for USAC-2, (USAC=Unified Speech and Audio Coding) and is subsequently outlined in more detail.
Noise power spectral density estimation based on optimal smoothing and minimum statistics [Mar01] introduces a noise estimator, which is capable of working independently of the signal being active speech or background noise. In contrast to other methods, the minimum statistics algorithm does not use any explicit threshold to distinguish between speech activity and speech pause and is therefore more closely related to soft-decision methods than to the traditional voice activity detection methods. Similar to soft-decision methods, it can also update the estimated noise PSD (Power Spectral Density) during speech activity.
The minimum statistics method rests on two observations namely that the speech and the noise are usually statistically independent and that the power of a noisy speech signal frequently decays to the power level of the noise. It is therefore possible to derive an accurate noise PSD (PSD=power spectral density) estimate by tracking the minimum of the noisy signal PSD. Since the minimum is smaller than (or in other cases equal to) the average value, the minimum tracking method necessitates a bias compensation.
The bias is a function of the variance of the smoothed signal PSD and as such depends on the smoothing parameter of the PSD estimator. In contrast to earlier work on minimum tracking, which utilizes a constant smoothing parameter and a constant minimum bias correction, a time and frequency dependent PSD smoothing is used, which also necessitates a time and frequency dependent bias compensation.
Using minimum tracking provides a rough estimate of the noise power. However, there are some shortcomings. The smoothing with a fixed smoothing parameter widens the peaks of speech activity of the smoothed PSD estimate. This will lead to inaccurate noise estimates as the sliding window for the minimum search might slip into broad peaks. Thus, smoothing parameters close to one cannot be used, and, as a consequence, the noise estimate will have a relatively large variance. Moreover, the noise estimate is biased toward lower values. Furthermore, in case of increasing noise power, the minimum tracking lags behind.
MMSE based noise PSD tracking with low complexity [HHJ10] introduces a background noise PSD approach utilizing an MMSE search used on a DFT (Discrete Fourier Transform) spectrum. The algorithm consists of these processing steps:                The maximum likelihood estimator is computed based on the noise PSD of the previous frame.        The minimum mean square estimator is computed.        The maximum likelihood estimator is estimated using the decision-directed approach [EM84].        The inverse bias factor is computed assuming that speech and noise DFT coefficients are Gaussian distributed.        The estimated noise power spectral density is smoothed.        
There is also a safety-net approach applied in order to avoid a complete dead lock of the algorithm.
Tracking of non-stationary noise based on data-driven recursive noise power estimation [EH08] introduces a method for the estimation of the noise spectral variance from speech signals contaminated by highly non-stationary noise sources. This method is also using smoothing in time/frequency direction.
A low-complexity noise estimation algorithm based on smoothing of noise power estimation and estimation bias correction [Yu09] enhances the approach introduced in [EH08]. The main difference is, that the spectral gain function for noise power estimation is found by an iterative data-driven method.
Statistical methods for the enhancement of noisy speech [Mar03] combine the minimum statistics approach given in [Mar01] by soft-decision gain modification [MCA99], by an estimation of the a-priori SNR [MCA99], by an adaptive gain limiting [MC99] and by a MMSE log spectral amplitude estimator [EM85].
Fade out is of particular interest for a plurality of speech and audio codecs, in particular, AMR (see [3GP12b]) (including ACELP and CNG), AMR-WB (see [3GP09c]) (including ACELP and CNG), AMR-WB+ (see [3GP09a]) (including ACELP, TCX and CNG), G.718 (see [ITU08a]), G.719 (see [ITU08b]), G.722 (see [ITU07]), G.722.1 (see [ITU05]), G.729 (see [ITU12, CPK08, PKJ+11]), MPEG-4 HE-AAC/Enhanced aacPlus (see [EBU10, EBU12, 3GP12e, LS01, QD03]) (including AAC and SBR), MPEG-4 HILN (see [ISO09, MEP01]) and OPUS (see [IET12]) (including SILK and CELT).
Depending on the codec, fade-out is performed in different domains:
For codecs that utilize LPC, the fade-out is performed in the linear predictive domain (also known as the excitation domain). This holds true for codecs which are based on ACELP, e.g., AMR, AMR-WB, the ACELP core of AMR-WB+, G.718, G.729, G.729.1, the SILK core in OPUS; codecs which further process the excitation signal using a time-frequency transformation, e.g., the TCX core of AMR-WB+, the CELT core in OPUS; and for comfort noise generation (CNG) schemes, that operate in the linear predictive domain, e.g., CNG in AMR, CNG in AMR-WB, CNG in AMR-WB+.
For codecs that directly transform the time signal into the frequency domain, the fade-out is performed in the spectral/subband domain. This holds true for codecs which are based on MDCT or a similar transformation, such as AAC in MPEG-4 HE-AAC, G.719, G.722 (subband domain) and G.722.1.
For parametric codecs, fade-out is applied in the parametric domain. This holds true for MPEG-4 HILN.
Regarding fade-out speed and fade-out curve, a fade-out is commonly realized by the application of an attenuation factor, which is applied to the signal representation in the appropriate domain. The size of the attenuation factor controls the fade-out speed and the fade-out curve. In most cases the attenuation factor is applied frame wise, but also a sample wise application is utilized see, e.g., G.718 and G.722.
The attenuation factor for a certain signal segment might be provided in two manners, absolute and relative.
In the case where an attenuation factor is provided absolutely, the reference level is the one of the last received frame. Absolute attenuation factors usually start with a value close to 1 for the signal segment immediately after the last good frame and then degrade faster or slower towards 0. The fade-out curve directly depends on these factors. This is, e.g., the case for the concealment described in Appendix IV of G.722 (see, in particular, [ITU07, figure IV.7]), where the possible fade-out curves are linear or gradually linear. Considering a gain factor g(n), whereas g(0) represents the gain factor of the last good frame, an absolute attenuation factor αabs(n), the gain factor of any subsequent lost frame can be derived asg(n)=αabs(n)·g(0)  (21)
In the case where an attenuation factor is provided relatively, the reference level is the one from the previous frame. This has advantages in the case of a recursive concealment procedure, e.g., if the already attenuated signal is further processed and attenuated again.
If an attenuation factor is recursively applied, then this might be a fixed value independent of the number of consecutively lost frames, e.g., 0.5 for G.719 (see above); a fixed value relative to the number of consecutively lost frames, e.g., as proposed for G.729 in [CPK08]: 1.0 for the first two frames, 0.9 for the next two frames, 0.8 for the frames 5 and 6, and 0 for all subsequent frames (see above); or a value which is relative to the number of consecutively lost frames and which depends on signal characteristics, e.g., a faster fade-out for an instable signal and a slower fade-out for a stable signal, e.g., G.718 (see section above and [ITU08a, table 44]);
Assuming a relative fade-out factor 0≤αrel(n)≤1, whereas n is the number of the lost frame (n≥1); the gain factor of any subsequent frame can be derived as
                              g          ⁡                      (            n            )                          =                                            α              rel                        ⁡                          (              n              )                                ·                      g            ⁡                          (                              n                -                1                            )                                                          (        22        )                                          g          ⁡                      (            n            )                          =                              (                                          ∏                                  m                  =                  1                                n                            ⁢                                                          ⁢                              α                ⁡                                  (                  m                  )                                                      )                    ·                      g            ⁡                          (              0              )                                                          (        23        )                                          g          ⁡                      (            n            )                          =                              α            rel            n                    ·                      g            ⁡                          (              0              )                                                          (        24        )            resulting in an exponential fading.
Regarding the fade-out procedure, usually, the attenuation factor is specified, but in some application standards (DRM, DAB+) the latter is left to the manufacturer.
If different signal parts are faded separately, different attenuation factors might be applied, e.g., to fade tonal components with a certain speed and noise-like components with another speed (e.g., AMR, SILK).
Usually, a certain gain is applied to the whole frame. When the fading is performed in the spectral domain, this is the only way possible. However, if the fading is done in the time domain or the linear predictive domain, a more granular fading is possible. Such more granular fading is applied in G.718, where individual gain factors are derived for each sample by linear interpolation between the gain factor of the last frame and the gain factor of the current frame.
For codecs with a variable frame duration, a constant, relative attenuation factor leads to a different fade-out speed depending on the frame duration. This is, e.g., the case for AAC, where the frame duration depends on the sampling rate.
To adopt the applied fading curve to the temporal shape of the last received signal, the (static) fade-out factors might be further adjusted. Such further dynamic adjustment is, e.g., applied for AMR where the median of the previous five gain factors is taken into account (see [3GP12b] and section 1.8.1). Before any attenuation is performed, the current gain is set to the median, if the median is smaller than the last gain, otherwise the last gain is used. Moreover, such further dynamic adjustment is, e.g., applied for G729, where the amplitude is predicted using linear regression of the previous gain factors (see [CPK08, PKJ+11] and section 1.6). In this case, the resulting gain factor for the first concealed frames might exceed the gain factor of the last received frame.
Regarding the target level of the fade-out, with the exception of G.718 and CELT, the target level is 0 for all analyzed codecs, including those codecs' comfort noise generation (CNG).
In G.718, fading of the pitch excitation (representing tonal components) and fading of the random excitation (representing noise-like components) is performed separately. While the pitch gain factor is faded to zero, the innovation gain factor is faded to the CNG excitation energy.
Assuming that relative attenuation factors are given, this leads—based on formula (23)—to the following absolute attenuation factor:g(n)=αrel(n)·g(n−1)+(1−αrel(n))·gn  (25)with gn being the gain of the excitation used during the comfort noise generation. This formula corresponds to formula (23), when gn=0.
G.718 performs no fade-out in the case of DTX/CNG.
In CELT there is no fading towards the target level, but after 5 frames of tonal concealment (including a fade-out) the level is instantly switched to the target level at the 6th consecutively lost frame. The level is derived band wise using formula (19).
Regarding the target spectral shape of the fade-out, all analyzed pure transform based codecs (AAC, G.719, G.722, G.722.1) as well as SBR simply prolong the spectral shape of the last good frame during the fade-out.
Various speech codecs fade the spectral shape to a mean using the LPC synthesis. The mean might be static (AMR) or adaptive (AMR-WB, AMR-WB+, G.718), whereas the latter is derived from a static mean and a short term mean (derived by averaging the last n LP coefficient sets) (LP=Linear Prediction).
All CNG modules in the discussed codecs AMR, AMR-WB, AMR-WB+, G.718 prolong the spectral shape of the last good frame during the fade-out.
Regarding background noise level tracing, there are five different approaches known from the literature:                Voice Activity Detector based: based on SNR/VAD, but very difficult to tune and hard to use for low SNR speech.        Soft-decision scheme: The soft-decision approach takes the probability of speech presence into account [SS98] [MPC89] [HE95].        Minimum statistics: The minimum of the PSD is tracked holding a certain amount of values over time in a buffer, thus enabling to find the minimal noise from the past samples [Mar01] [HHJ10] [EH08] [Yu09].        Kalman Filtering: The algorithm uses a series of measurements observed over time, containing noise (random variations), and produces estimates of the noise PSD that tend to be more precise than those based on a single measurement alone. The Kalman filter operates recursively on streams of noisy input data to produce a statistically optimal estimate of the system state [Gan05] [BJH06].        Subspace Decomposition: This approach tries to decompose a noise like signal into a clean speech signal and a noise part, utilizing for example the KLT (Karhunen-Loève transform, also known as principal component analysis) and/or the DFT (Discrete Time Fourier Transform). Then the eigenvectors/eigenvalues can be traced using an arbitrary smoothing algorithm [BP06] [HJH08].        