The present invention relates to the field of the transmission of coded audio signals, more specifically to a method and an apparatus for obtaining, or acquiring, spectrum coefficients for a replacement frame of an audio signal, to an audio decoder, to an audio receiver and to a system for transmitting audio signals. Embodiments relate to an approach for constructing a spectrum for a replacement frame based on previously received frames.
In conventional technology, several approaches are described dealing with a frame-loss at an audio receiver. For example, when a frame is lost on the receiver side of an audio or speech codec, simple methods for the frame-loss-concealment as described in P. Lauber and R. Sperschneider, “Error Concealment for Compressed Digital Audio,” in AES 111th Convention, New York, USA, 2001 (hereinafter “the Lauber reference”) may be used, such as:
repeating the last received frame,
muting the lost frame, or
sign scrambling.
Additionally, in the Lauber reference, an advanced technique using predictors in sub-bands is presented. The predictor technique is then combined with sign scrambling, and the prediction gain is used as a sub-band wise decision criterion to determine which method will be used for the spectral coefficients of this sub-band.
In U.S. Pat. No. 6,351,730 B2 (C. J. Hwey, “Low-complexity, low-delay, scalable and embedded speech and audio coding with adaptive frame loss concealment,” hereinafter “the '730 Patent”), a waveform signal extrapolation in the time domain is used for a MDCT (Modified Discrete Cosine Transform) domain codec. This kind of approach may be good for monophonic signals including speech.
If one frame delay is allowed, an interpolation of the surrounding frames can be used for the construction of the lost frame. Such an approach is described in US Patent Application Publication No. 2007/094009 A1 (S. K. Gupta, E. Choy and S.-U. Ryu, “Encoder-assisted frame loss concealment techniques for audio coding,” hereinafter “the '009 Publication”), where the magnitudes of the tonal components in the lost frame with an index m are interpolated using the neighboring frames indexed m−1 and m+1. The side information that defines the MDCT coefficient signs for tonal components is transmitted in the bit-stream. Sign scrambling is used for other non-tonal MDCT coefficients. The tonal components are determined as a predetermined fixed number of spectral coefficients with the highest magnitudes. This approach selects n spectral coefficients with the highest magnitudes as the tonal components.
            C      m      *        ⁡          (      k      )        =            1      2        ⁢          (                                    C                          m              -              1                                ⁡                      (            k            )                          +            )        ⁢                  C                  m          +          1                    ⁡              (        k        )            
FIG. 7 shows a block diagram representing an interpolation approach without transmitted side information as it is for example described in S.-U. Ryu and K. Rose, “A Frame Loss Concealment Technique for MPEG-AAC,” in 120th AES Convention, Paris, France, 2006 (hereinafter “Ryu 2006/Paris”. The interpolation approach operates on the basis of audio frames coded in the frequency domain using MDCT (modified discrete cosine transform). A frame interpolation block 700 receives the MDCT coefficients of a frame preceding the lost frame and a frame following the lost frame, more specifically in the approach described with regard to FIG. 7, the MDCT coefficients Cm−1(k) of the preceding frame and the MDCT coefficients Cm+1(k) of the following frame are received at the frame interpolation block 700. The frame interpolation block 700 generates an interpolated MDCT coefficient Cm(k) for the current frame which has either been lost at the receiver or cannot be processed at the receiver for other reasons, for example due to errors in the received data or the like. The interpolated MDCT coefficient Cm(k) output by the frame interpolation block 700 is applied to block 702 causing a magnitude scaling in scale factor band and to block 704 causing a magnitude scaling with an index set, and the respective blocks 702 and 704 output the MDCT coefficient Cm(k) scaled by the factor {circumflex over (α)}(k) and {tilde over (α)}(k), respectively. The output signal of block 702 is input into the pseudo spectrum block 706 generating on the basis of the received input signal the pseudo spectrum {circumflex over (P)}m(k) that is input into the peak detection block 708 a signal indicating detected peaks. The signal provided by block 702 is also applied to the random sign change block 712 which, responsive to the peak detection signal generated by block 708, causes a sign change of the received signal and outputs a modified MDCT coefficient Ĉm(k) to the spectrum composition block 710. The scaled signal provided by block 704 is applied to a sign correction block 714 causing, in response to the peak detection signal provided by block 708 a sign correction of the scaled signal provided by block 704 and outputting a modified MDCT coefficient {tilde over (C)}m(k) to the spectrum composition block 710 which, on the basis of the received signals, generates the interpolated MDCT coefficient C*m(k) that is output by the spectrum composition block 710. As is shown in FIG. 7, the peak detection signal provided by block 708 is also provided to block 704 generating the scaled MDCT coefficient.
FIG. 7 generates at the output of the block 714 the spectral coefficients {tilde over (C)}m(k) for the lost frame associated with tonal components, and at the output of the block 712 the spectral coefficients Ĉm(k) for non-tonal components are provided so that at the spectrum composition block 710 on the basis of the spectral coefficients received for the tonal and non-tonal components the spectral coefficients for the spectrum associated with the lost frame are provided.
The operation of the FLC (Frame Loss Concealment) technique described in the block diagram of FIG. 7 will now be described in further detail.
In FIG. 7, basically, four modules can be distinguished:                a shaped-noise insertion module (including the frame interpolation 700, the magnitude scaling within the scale factor band 702 and the random sign change 712),        a MDCT bin classification module (including the pseudo spectrum 706 and the peak detection 708),        a tonal concealment operations module (including the magnitude scaling within the index set 704 and the sign correction 714), and        the spectrum composition 710.        
The approach is based on the following general formula:Cm(k)=C*m(k)α*(k)s*(k), 0≦k<M 
C*m(k) is derived by a bin-wise interpolation (see block 700 “Frame Interpolation”):C*m(k)=½(Cm−1(k)+Cm+1(k))
α*(k) is derived by an energy interpolation using the geometric mean:                scale factor band wise for all components, (see block 702 “Magnitude Scaling in Scalefactor Band”) and        index sub-set wise for tonal components (see block 704 “Magnitude Scaling within Index Set”):        
                    (                  α          *                )            2        ⁢          (      k      )        =                              E                      m            +            1                          ⁢                  E                      m            -            1                                      E      m      
For tonal components it can be shown that α=cos(πfl), with fl being the frequency of the tonal component.
The energies E are derived based on a pseudo power spectrum, derived by a simple smoothing operation:P(k)≅C2(k)+{C(k+1)−C(k−1)}2 
s*(k) is set randomly to ±1 for non-tonal components (see block 712 “Random Sign Change”), and to either +1 or −1 for tonal components (see block 714 “Sign Correction”).
The peak detection is performed as searching for local maxima in the pseudo power spectrum to detect the exact positions of the spectral peaks corresponding to the underlying sinusoids. It is based on the tone identification process adopted in the MPEG-1 psychoacoustic model described in ISO/IEC JTC1/SC29/WG11, Information technology—Coding of moving pictures and associated, International Organization for Standardization, 1993. Out of this, an index sub-set is defined having the bandwidth of an analysis window's main-lobe in terms of MDCT bins and the detected peak in its center. Those bins are treated as tone dominant MDCT bins of a sinusoid, and the index sub-set is treated as an individual tonal component.
The sign correction s*(k) flips either the signs of all bins of a certain tonal component, or none. The determination is performed using an analysis by synthesis, i.e., the SFM is derived for both versions and the version with the lower SFM is chosen. For the SFM derivation, the power spectrum is needed, which in return may use the MDST (Modified Discrete Sine Transform) coefficients. For keeping the complexity manageable, only the MDST coefficients for the tonal component are derived, using also only the MDCT coefficients of this tonal component.
FIG. 8 shows a block diagram of an overall FLC technique which, when compared to the approach of FIG. 7, is refined and which is described in S.-U. Ryu and R. Kenneth, An MDCT domain frame-loss concealment technique for MPEG Advanced Audio Coding, Department od Electrical and Computer Engineering, University of California, 2007 (hereinafter “Ryu 2007”). In FIG. 8, the MDCT coefficients Cm−1, and Cm+1 of a last frame preceding the lost frame and a first frame following the lost frame are received at an MDCT bin classification block 800. These coefficients are also provided to the shape-noise insertion block 802 and to the MDCT estimation for a tonal components block 804. At block 804 also the output signal provided by the classification block 800 is received as well as the MDCT coefficients Cm−2 and Cm+2 of the second to last frame preceding the lost frame and the second frame following the lost frame, respectively, are received. The block 804 generates the MDCT coefficients {tilde over (C)}m of the lost frame for the tonal components, and the shape-noise insertion block 802 generates the MDCT spectral coefficients for the lost frame Ĉm for non-tonal components. These coefficients are supplied to the spectrum composition block 806 generating at the output the spectral coefficients C*m for the lost frame. The shape-noise insertion block 802 operates in reply to the system IT generated by the estimation block 804.
The following modifications are of interest with respect to the Ryu 2006/Paris reference:                The pseudo power spectrum used for the peak detection is derived asPm(k)=Cm−12(k)+Cm+12(k)        To eliminate perceptually irrelevant or spurious peaks, the peak detection is only applied to a limited spectral range and only local maxima that exceed a relative threshold to the absolute maximum of the pseudo power spectrum are considered. The remaining peaks are sorted in descending order of their magnitude, and a pre-specified number of top-ranking maxima are classified as tonal peaks.        The approach is based on the following general formula (with a being signed this time):Cm(k)=C*m(k)α(k), 0≦k<M         C*m(k) is derived as above, but the derivation of a becomes more advanced, following the approachEm(α)=½{Em−1(α)+Em+1(α)}         Substituting Em, Em−1, and Em+1 withEm−1(α)≅|cm−1|2+|sm−1|2=|cm−1|2+|ξ1+αζ1|2 Em(α)≅α2|cm|2+|sm|2=α2|cm|2+|ξ2+αζ2|2 Em+1(α)≅|cm+1|2+|sm+1|2=|cm+1|2+|ξ3+αζ3|2 whereassm−1≅A1cm−2+A2cm−1+αA3cm=ξ1+αζ1 sm≅A1cm−1+αlA2cm+A3cm+1=ξ2+αζ2 sm+1≅αA1cm+A2cm+1+A3cm+2=ξ3+αζ3          yields an expression that is quadratic in α. Hence, for the given MDCT estimate there exist two candidates (with opposite signs) for the multiplicative correction factor (A1, A2, A3 are the transformation matrices). The selection of the better estimate is performed similar to what is described in the Ryu 2006/Paris reference.        This advanced approach may use two frames before and after the frame loss in order to derive the MDST coefficients of the previous and the subsequent frame.        
A delay-less version of this approach is suggested in S.-U. Ryu, Source Modeling Approaches to Enhanced Decoding in Lossy Audio Compression and Communication, UNIVERSITY of CALIFORNIA Santa Barbara, 2006 (hereinafter “Ryu 2006/California”):                As a starting point, the interpolation formula C*m(k)=½(Cm−1(k)+Cm+1(k)) is reused, but is applied for the frame m−1, resulting in:Cm(k)=2C*m−1(k)−Cm−2(k)        Then, the interpolation result C*m−1 is replaced by the true estimation (here, the factor 2 becomes part of the correction factor: α=2 cos(πfl)), which leads toCm(k)=αCm−1(k)−Cm−2(k)        The correction factor is determined by observing the energies of two previous frames. From the energy computation, the MDST coefficients of the previous frame are approximated assm−1≅(A1−A3)cm−2+A2cm−1+αA3cm−1=ξ0+αζ0         Then, the sinusoidal energy is computed asEm−1(α)≅|cm−1|2+|sm−1|2=|cm−1|2+|ξ0+αζ0|2         Similarly, the sinusoidal energy for frame m−2 is computed and denoted by Em−2, which is independent of α.        Employing the energy requirementEm−1(α)=Em−2          yields again an expression that is quadratic in α.        The selection process for the candidates computed is performed as before, but the decision rule accounts only the power spectrum of the previous frame.        
Another delay-less frame-loss-concealment in the frequency domain is described in European Patent No. EP 0574288 B1 (M. Yannick, “Method and apparatus for transmission error concealment of frequency transform coded digital audio signals,” hereinafter “the '288 Patent”. The teachings of reference the '288 Patent can be simplified, without loss of generality, as:                Prediction using a DFT of a time signal:                    (a) Obtain the DFT spectrum from the decoded time domain signal that corresponds to the received coded frequency domain coefficients Cm.            (b) Modulate the DFT magnitudes, assuming a linear phase change, to predict the missing frequency domain coefficients in the next frame Cm+1.                        Prediction using a magnitude estimation from the received frequency spectra:                    (a) Find C′m and S′m, using Cm as input, such thatC′m(k)=Qm(k)cos(φm(k)+χ)S′m(k)=Qm(k)sin(φm(k)+χ)                         where Qm(k) is the magnitude of the DFT coefficient that corresponds to Cm(k).                    (b) Calculate:                        
                    Q        m            ⁡              (        k        )              =                                                                                  C                m                ′                            ⁡                              (                k                )                                                          2                +                                                                        S                m                ′                            ⁡                              (                k                )                                                          2                                        φ        m            ⁡              (        k        )              =          arccos      ⁢                                    C            m                    ⁡                      (            k            )                                                Q            m                    ⁡                      (            k            )                                                                  (c) Perform a linear extrapolation of the magnitude and the phase:Qm+1(k)=2Qm(k)−Qm−1(k)φm+1(k)=2φm(k)−φm−1(k)Cm+1(k)=Qm+1(k)cos(φm+1(k))                        Use filters to calculate C′m and S′m from Cm and then proceed as above to get Cm+1(k)        Use an adaptive filter to calculate Cm+1(k):        
            C              m        +        1              ⁡          (      k      )        =                    ∑                  i          =          0                I            ⁢                        a                      m            ,            i                          ⁡                  (          k          )                      +                  C                  m          -          i                    ⁡              (        k        )            
The selection of spectrum coefficients to be predicted is mentioned in the '288 Patent but is not described in detail.
In Y. Mahieux, J.-P. Petit and A. Charbonnier, “Transform coding of audio signals using correlation between successive transform blocks,” in Acoustics, Speech, and Signal Processing, 1989. ICASSP-89, 1989, it has been recognized that, for quasi-stationary signals, the phase difference between successive frames is almost constant and depends only on the fractional frequency. However, only a linear extrapolation from the last two complex spectra is used.
In AMR-WB+ (see 3GPP; Technical Specification Group Services and System Aspects, Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec, 2009) a method described in U.S. Pat. No. 7,356,748 B2 (A. Taleb, “Partial Spectral Loss Concealment in Transform Codecs,” hereinafter “the '748 Patent”) is used. The method in the '748 Patent is an extension of the method described in reference the '288 Patent in a sense that it uses also the available spectral coefficients of the current frame, assuming that only a part of the current frame is lost. However, the situation of a complete loss of a frame is not considered in the '748 Patent.
Another delay-less frame-loss-concealment in the MDCT domain is described in US Patent Application Publication No. 2012/109659 A1 (C. Guoming, D. Zheng, H. Yuan, J. Li, J. Lu, K. Liu, K. Peng, L. Zhibin, M. Wu and Q. Xiaojun, “Compensator and Compensation Method for Audio Frame Loss in Modified Discrete Cosine Transform Domain,” hereinafter “the '659 Publication”. In the '659 Publication, it is first determined if the lost Pth frame is a multiple-harmonic frame. The lost Pth frame is a multiple-harmonic frame if more than K0 frames among K frames before the Pth frame have a spectrum flatness smaller than a threshold value. If the lost Pth frame is a multiple-harmonic frame then (P−K)th to (P−2)nd frames in the MDCT-MDST domain are used to predict the lost Pth frame. A spectral coefficient is a peak if its power spectrum is bigger than the two adjacent power spectrum coefficients. A pseudo spectrum as described in L. S. M. Dauder, “MDCT Analysis of Sinusoids: Exact Results and Applications to Coding Artifacts Reduction,” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, pp. 302-312, 2004 (hereinafter “Dauder”), is used for the (P−1)st frame.
A set of spectral coefficients Sc is constructed from L1 power spectrum frames as follows.
Obtaining L1 sets S1, . . . , SL1 composed of peaks in each of L1 frames, a number of peaks in each set being N1, . . . , NL1, respectively. Selecting a set Si from the L1 sets of S1, . . . , SL1. For each peak coefficient mj, j=1 . . . Ni in the set S1, judging whether there is any frequency coefficient among mj, mj±1, . . . , mj±k belonging to all other peak sets. If there is any, putting all the frequencies mj, mj±1, . . . , mj±k into the frequency set SC. If there is no frequency coefficient belonging to all other peak sets, directly putting all the frequency coefficients in a frame into the frequency set SC. Said k is a nonnegative integer. For all spectral coefficients in the set SC the phase is predicted using L2 frames among (P−K)th to (P−2)nd MDCT-MDST frames. The prediction is done using a linear extrapolation (when L2=2) or a linear fit (when L2>2). For the linear extrapolation:
                    φ        ^            p        ⁡          (      m      )        =                    φ                  t          ⁢                                          ⁢          1                    ⁡              (        m        )              +                            p          -                      t            ⁢                                                  ⁢            1                                                              t              ⁢                                                          ⁢              1                        -                          t              ⁢                                                          ⁢              2                                ⁢                                                    ⁡              [                                            φ                              t                ⁢                                                                  ⁢                1                                      ⁡                          (              m              )                                -                                    φ                              t                ⁢                                                                  ⁢                2                                      ⁡                          (              m              )                                      ]            where p, t1 and t2 are frame indices.
The spectral coefficients not in the set SC are obtained using a plurality of frames before the (P−1)st frame, without specifically explaining how.