The present invention relates to the field of coding/decoding digital signals.
The invention advantageously applies to coding/decoding sounds which may contain speech and music, either mixed together or alternating.
In order to efficiently code the speech sounds at low rate, CELP type techniques (“Code Excited Linear Prediction”) are recommended. In order to efficiently code the music sounds, transform coding techniques are recommended instead.
CELP type coders are predictive coders. Their objective is to model speech production from various elements: a short-term linear prediction to model the vocal tract, a long-term prediction to model the vocal cord vibration during a voiced period, and an excitation derived from a fixed dictionary (white noise, algebraic expectation) to represent the “innovation” which could not be modeled.
The transform coders such as MPEG AAC, AAC-LD, AAC-ELD, or ITU-T G.722.1 Annex C, for example, use critical sampling transforms in order to pack the signal in the transform domain. “Critical sampling transform” refers to a transform for which the number of coefficients in the transformed domain is equal to the number of time samples in each analyzed frame.
A solution for efficiently coding a signal with mixed speech/music content consists in selecting over time the best technique between at least two coding modes, one of CELP type, the other of transform type.
It is for example the case for 3GPP AMR-WB+ and MPEG USAC codecs (for “Unified Speech Audio Coding”). The applications aimed by AMR-WB+ and USAC are not conversational, but corresponds to storing and disseminating services, without strong constraints on the algorithmic delay.
The initial version of the USAC codec, called RM0 (Reference Model 0), is described in the paper by M. Neuendorf et al., A Novel Scheme for Low Bitrate Unified Speech and Audio Coding—MPEG RM0, 7-10 May 2009, 126th AES Convention. This RM0 codec alternates between several coding modes:                For speech type signals: LPD modes (for “Linear Predictive Domain”) comprising two different modes derived from the AMR-WB+ coding:        An ACELP mode        A TCX (Transform Coded eXcitation) mode called wLPT (for “weighted Linear Predictive Transform”) using an MDCT type transform (unlike the AMR-WB+ codec which uses a FFT (“Fast Fourier transform”).        For music type signals: FD mode (for “Frequency Domain”) using an MDCT transform coding (for “Modified Discrete Cosine Transform”) of MPEG AAC type (for “Advanced Audio Coding”) over 1024 samples.        
In the USAC codec, the transitions between LPD and FD modes are crucial for ensuring a sufficient quality without switching flaws, knowing that each mode (ACELP, TCX, FD) has a specific “signature” (in terms of artefacts) and that the FD and LPD modes are different in nature—the FD mode is based on transform coding in the signal domain, whereas the LPD modes use predictive linear coding in the perceptual weighted domain with filter memories to correctly manage. Intermodal switching management in the USAC RM0 codec is detailed in the paper by J. Lecomte et al., “Efficient cross-fade windows for transitions between LPC-based and non-LPC based audio coding”, 7-10 May 2009, 126th AES Convention. As explained in this paper, the main difficulty is the transitions from LPD to FD modes and vice versa. Only the case of CELP to FD transitions is considered here.
In order to fully understand how it works, the principle of MDCT transform coding is recalled through a typical example of development.
At the coder the MDCT transformation is typically divided into three steps, the signal being split into frames of M samples prior to MDCT coding:                Weighting the signal by a window here called “MDCT window” with length 2M.        Time-domain aliasing for forming a block with length M.        DCT transforming (for “Discrete Cosine Transform”) with length M        
The MDCT window is divided into four adjacent portions with equal lengths M/2, here called “quarters”.
The signal is multiplied by the analysis window, and then aliasing is performed: the first quarter (windowed) is aliased (i.e. time reversed and overlapped) on the second quarter and the fourth quarter is aliased on the third.
More precisely, time-domain aliasing a quarter on another is performed in the following way: the first sample of the first quarter is added to (or subtracted from) the last sample of the second quarter, the second sample of the first quarter is added to (or subtracted from) the penultimate sample of the second quarter, and so on, until the last sample of the first quarter which is added to (or subtracted from) the first sample of the second quarter.
Therefore, from four quarters, 2 aliased quarters are obtained where each sample is the result of a linear combination of 2 signal samples to code. This linear combination induces time-domain aliasing.
The 2 aliased quarters are then jointly coded after the DCT transformation (of type IV). For the following frame there is a shift by half of a window (i.e. 50% of overlap), the third and fourth quarters of the preceding frame then become the first and the second quarter of the current frame. After aliasing, a second linear combination of the same sample pairs is sent like in the preceding frame, but with different weights.
At the decoder, after the DCT transformation, the decoded version of these aliased signals is therefore obtained. Two consecutive frames contain the result for two different aliasing events of the same quarters, i.e. for each sample pair there is the result for two linear combinations with different but known weights: an equation system may therefore be solved to obtain the decoded version of the input signal, time-domain aliasing may thus be eliminated using two consecutive decoded frames.
Solving the mentioned equation systems is generally implicitly performed by opening, multiplication by a synthesis window which is judiciously chosen, and then adding and overlapping the common parts (without discontinuity due to quantization errors) between 2 consecutive decoded frames, indeed this operation behaves like a an overlap-add. When the window for the first quarter or the fourth quarter is at zero for each sample, it is referred to an MDCT transformation without time-domain aliasing in this part of the window. In this case the smooth transition is not ensured by the MDCT transformation, it must be made by other means such as for example an external overlap-add.
It should be noted that there are variants of implementations for the MDCT transformation, in particular for defining the DCT transform, on the way of time-domain aliasing the block to transform (for example, signs applied to the quarters aliased left and right may be reversed, or the second and the third quarter may be aliased on the first and fourth quarters respectively), etc. These variants do not change the MDCT analysis-synthesis principle with sample block reduction by windowing, time-domain aliasing, then transforming, and finally windowing, aliasing, and adding-overlapping.
In the case of the USAC RM0 coder described in the paper by Lecomte et al., the transition between a frame coded by ACELP coding and a frame coded by FD coding, is performed in the following way:
A transition window for the FD mode is used with an overlap to the left of 128 samples.
Time-domain aliasing on this overlap area is cancelled by introducing an artificial time-domain aliasing to the right of the reconstructed ACELP frame. The MDCT windows used for the transition has a size of 2304 samples and the DCT transformation operates on 1152 samples whereas normally the FD mode frames are coded with a window with a size of 2048 samples and a DCT transformation of 1024 samples. Thus the MDCT transformation of the normal FD mode is not directly usable for the transition window, the coder must also integrate a modified version of this transformation which complicates the transition implementation for the FD mode.
This coding technique from the state-of-the-art has an algorithmic delay in the order of 100 to 200 ms. This delay is incompatible with conversational applications for which the coding delay is generally in the order of 20 to 25 ms for speech coders for the mobile applications (e.g. GSM EFR, 3GPP AMR and AMR-WB) and in the order of 40 ms for conversational transform coders for videoconferencing (for example UIT-T G.722.1 Annex C and G.719). Moreover, occasionally increasing the DCT transformation size (2304 vs 2048) causes a peak in complexity at the moment of transition.
To overcome these disadvantages, the international patent application WO2012/085451, wherein the content is incorporated by reference to the present application, proposes a new method of coding a transition frame. The transition frame is defined as the transform coded current frame following a preceding frame coded by predictive coding. According to the above-mentioned new method, part of the transition frame, for example a sub-frame of 5 ms, in the case of CELP coding at 12.8 GHz, and two extra CELP frames of 4 ms each, in the case of a CELP coding at 16 kHz, are coded by predictive coding restricted with respect to predictive coding the preceding frame.
The restricted predictive coding consists in using the stable parameters of the preceding frame coded by predictive coding, such as for example the linear prediction filter coefficients and in only coding some minimum parameters for the extra sub-frame in the transition frame.
As the preceding frame has not been coded with transform coding, cancelling time-domain aliasing in the first part of the frame is impossible. The above-mentioned patent application WO2012/085451 further proposes to modify the first half of the MDCT window so as to not have time-domain aliasing in the first quarter which is usually aliased. It is also proposed to integrate part of the overlap-add between the decoded CELP frame and the decoded MDCT frame by modifying the coefficients of the analysis/synthesis window. In reference to FIG. 4e from the above-mentioned application, the chain-dotted lines (lines alternating between dots and dashes) corresponds to MDCT coding aliasing lines (top figure) and to the MDCT decoding aliasing lines (bottom figure). On the top figure, the bold lines separate the frames of new samples at the coder input. Coding a new MDCT frame may be initiated when a frame as defined for new input samples is completely available. It is important to note is that these bold lines at the coder do not correspond to the current frame the two successive blocks of new samples arriving for each frame: the current frame is in fact delayed by 8.75 ms corresponding to anticipation, named “lookahead”. On the bottom figure, the bold lines separate the decoded frames at the decoder output.
At the coder, the transition window is null until the aliasing point. Thus the coefficients of the left part of the aliased window are identical to those of the non-aliased window. The part between the aliasing point and the end of this transition (TR) CELP sub-frame corresponds to a sinusoidal half-window. At the decoder, after opening, the same window is applied to the signal. On the segment between the aliasing point and the start of the MDCT frame, the window coefficients correspond to a sin2 window. To ensure the addition-overlap between the decoded CELP sub-frame and the signal originating from the MDCT, only applying a cos2 type window to the part of the CELP sub-frame in overlap and to sum the latter with the MDCT frame is required. The method is of perfect reconstruction.
However, the application WO2012/085451 provides allocating a bit budget Btrans for coding the CELP sub-frame corresponding to the required budget for CELP coding a classic frame, brought down to a single sub-frame. The remaining budget for transform coding the transition frame is then insufficient and might lead to a quality decrease at low rate.