To effectively code speech sounds, the techniques of CELP (“Code Excited Linear Prediction”) type or its variant ACELP (“Algebraic Code Excited Linear Prediction”) are advocated, alternatives to CELP coding such as the BV16, BV32, iLBC or SILK coders have also been proposed more recently. On the other hand, transform coding techniques are advocated to effectively code musical sounds.
Linear prediction coders, and more particularly those of CELP type, are predictive coders. Their aim is to model the production of speech on the basis of at least some part of the following elements: a short-term linear prediction to model the vocal tract, a long-term prediction to model the vibration of the vocal cords in a voiced period, and an excitation derived from a vector quantization dictionary in general termed a fixed dictionary (white noise, algebraic excitation) to represent the “innovation” which it was not possible to model by prediction.
The transform coders most used (MPEG AAC or ITU-T G.722.1 Annex C coder for example) use critical-sampling transforms of MDCT (“Modified Discrete Transform”) type so as to compact the signal in the transformed domain. “Critical-sampling transform” refers to a transform for which the number of coefficients in the transformed domain is equal to the number of temporal samples analyzed.
A solution for effectively coding a signal containing these two types of content consists in selecting over time (frame by frame) the best technique. This solution has in particular been advocated by the 3GPP (“3rd Generation Partnership Project”) standardization body through a technique named AMR WB+ (or Enhanced AMR-WB) and more recently by the MPEG-H USAC (“Unified Speech Audio Coding”) codec. The applications envisaged by AMR-WB+ and USAC are not conversational, but correspond to broadcasting and storage services, without heavy constraints on the algorithmic delay.
The USAC standard is published in the ISO/IEC document 23003-3:2012, Information technology—MPEG audio technologies—Part 3: Unified speech and audio coding.
By way of illustration, the initial version of the USAC codec, called RM0 (Reference Model 0), is described in the article by M. Neuendorf et al., A Novel Scheme for Low Bitrate Unified Speech and Audio Coding—MPEG RM0, 7-10 May 2009, 126th AES Convention. This codec alternates between at least two modes of coding:                For signals of speech type: LPD (“Linear Predictive Domain”) modes using an ACELP technique        For signals of music type: FD (“Frequency Domain”) mode using an MDCT (“Modified Discrete Transform”) technique.The principles of the ACELP and MDCT codings are recalled hereinbelow.        
On the one hand, CELP coding—including its ACELP variant—is a predictive coding based on the source-filter model. In general the filter corresponds to an all-pole filter with transfer function 1/A(z) obtained by linear prediction (LPC for Linear Predictive Coding). In practice the synthesis uses the quantized version, 1/Â(z), of the filter 1/A(z). The source—that is to say the excitation of the predictive linear filter 1/Â(z)—is in general the combination of an excitation obtained by long-term prediction which models the vibration of the vocal cords, and of a stochastic excitation (or innovation) described in the form of algebraic codes (ACELP), of noise dictionaries, etc. The search for the “optimal” excitation is carried out by minimization of a quadratic error criterion in the domain of the signal weighted by a filter with transfer function W(z) in general derived from the linear prediction filter A(z), of the form W(z)=A(z/γ1)/A(z/γ2). It will be noted that numerous variants of the CELP model have been proposed and the example of the CELP coding of the UIT-T G.718 standard will be retained here, in which two LPC filters are quantized per frame and the LPC excitation is coded as a function of a classification, with modes adapted for voiced, unvoiced, transient sounds, etc. Moreover, alternatives to CELP coding have also been proposed, including the BV16, BV32, iLBC or SILK coders which are still based on linear prediction. In general, predictive coding, including CELP coding, operates at limited sampling frequencies (≤16 kHz) for historical and other reasons (wide band linear prediction limits, algorithmic complexity for high frequencies, etc.); thus, to operate with frequencies of typically 16 to 48 kHz, resampling operations (by FIR filter, filter banks or IIR filter) are also used and optionally a separate coding for the high band which may be a parametric band extension—these resampling and high band coding operations are not reviewed here.
On the other hand, MDCT transformation coding is divided between three steps at the coder:                1. Weighting of the signal by a window called here “MDCT window” over a length corresponding to 2 blocks        2. Temporal aliasing (or “time-domain aliasing”) to form a reduced block (of length divided by 2)        3. DCT-IV (“Discrete Cosine Transform”) Transformation of the reduced block.        
It will be noted that calculation variants of TDAC transformation type which can use for example a Fourier transform (FFT) instead of a DCT transform.
The MDCT window is in general divided into 4 adjacent portions of equal lengths called “quarters”.
The signal is multiplied by the analysis window and then the aliasings are performed: the first quarter (windowed) is aliased (that is to say reversed in time and overlapped) on the second and the fourth quarter is aliased on the third.
More precisely, the aliasing of one quarter on another is performed in the following manner: The first sample of the first quarter is added to (or subtracted from) the last sample of the second quarter, the second sample of the first quarter is added to (or subtracted from) the last-but-one sample of the second quarter, and so on and so forth until the last sample of the first quarter which is added to (or subtracted from) the first sample of the second quarter.
Therefore, from 4 quarters are obtained 2 aliased quarters where each sample is the result of a linear combination of 2 samples of the signal to be coded. This linear combination is called temporal aliasing. It will be noted that temporal aliasing corresponds to mixing two temporal segments and the relative level of two temporal segments in each “aliased quarter” is dependent on the analysis/synthesis windows.
These 2 aliased quarters are thereafter coded jointly after DCT transformation. For the following frame there is a shift of half a window (i.e. 50% overlap), the third and fourth quarters of the previous frame become the first and second quarter of the current frame. After aliasing, a second linear combination of the same pairs of samples as in the previous frame is dispatched, but with different weights.
At the decoder, after inverse DCT transformation, the decoded version of these aliased signals is therefore obtained. Two consecutive frames contain the result of 2 different aliasings of the same 2 quarters, that is to say for each pair of samples we have the result of 2 linear combinations with different but known weights: an equation system is therefore solved to obtain the decoded version of the input signal, the temporal aliasing can thus be dispensed with by using 2 consecutive decoded frames.
The systems of equations mentioned are in general solved by de-aliasing, multiplication by a judiciously chosen synthesis window and then overlap-add of the common parts. This overlap-add ensures at the same time the gentle transition (without discontinuity due to quantization errors) between 2 consecutive decoded frames, indeed this operation behaves like a crossfade. When the window for the first quarter or fourth quarter is at zero for each sample, one speaks of an MDCT transformation without temporal aliasing in this part of the window. In this case the gentle transition is not ensured by the MDCT transformation, it must be done by other means such as for example an exterior crossfade.
Transform coding (including coding of MDCT type) can in theory easily be adapted to various input and output sampling frequencies, as illustrated by the combined implementation in annex C of G.722.1 including the G.722.1 coding; however, it is also possible to use transform coding with pre/post-processing operations with resampling (by FIR filter, filter banks or IIR filter), with optionally a separate coding of the high band which may be a parametric band extension—these resampling and high band coding operations are not reviewed here, but the 3GPP e-AAC+ coder gives an exemplary embodiment of such a combination (resampling, low band transform coding and band extension).
It should be noted that the acoustic band coded by the various modes (linear prediction based temporal LPD, transform based frequential FD) can vary according to the mode selected and the bitrate. Moreover, the mode decision may be carried out in open-loop for each frame, that is to say that the decision is taken a priori as a function of the data and of the observations available, or in closed-loop as in AMR-WB+ coding.
In codecs using at least two modes of coding, the transitions between LPD and FD modes are important in ensuring sufficient quality with no switching defect, knowing that the FD and LPD modes are of different kinds—one relies on a transform coding in the frequency domain of the signal, while the other uses a (temporal) predictive linear coding with filter memories which are updated at each frame. An example of managing the inter-mode switchings corresponding to the USAC RM0 codec is detailed in the article by J. Lecomte et al., “Efficient cross-fade windows for transitions between LPC-based and non-LPC based audio coding”, 7-10 May 2009, 126th AES Convention. As explained in this article, the main difficulty resides in the transitions between LPD to FD modes and vice versa.
To deal with the problem of transition between a core of FD type to a core of LPD type, the patent application published under the number WO2013/016262 (illustrated in FIG. 1) proposes to update the memories of the filters of the codec of LPD type (130) coding the frame m+1 by using the synthesis of the coder and of the decoder of FD type (140) coding the frame m, the updating of the memories being necessary solely during the coding of the frames of FD type. This technique thus makes it possible during selection at 110 of the mode of coding and toggling (at 150) of the coding from FD to LPD type, to do so without transition defect (artifacts) since when coding the frame with the LPD technique, the memories (or states) of the CELP (LPD) coder have already been updated by the generator 160 on the basis of the reconstructed signal Ŝa(n) of the frame m. In the case where the two cores (FD and LDP) do not operate at the same sampling frequency, the technique described in patent application WO2013/016262 proposes a step of resampling the memories of the coder of FD type.
The drawback of this technique is on the one hand that it makes it necessary to have access to the decoded signal at the coder and therefore to force a local synthesis in the coder. On the other hand, it makes it necessary to carry out operations of updating the memories of the filters (possibly comprising a resampling step) during the coding and decoding of FD type, as well as a set of operations amounting to carrying out an analysis/coding of CELP type in the previous frame of FD type. These operations may be complex and are superimposed with the conventional operations of coding/decoding in the transition frame of LPD type, thereby causing a “multi-mode” coding complexity spike.
A need therefore exists to obtain an effective transition between a transform coding or decoding and a predictive coding or decoding which do not require an increase in complexity of the coders or decoders provided for conversational applications of audio coding exhibiting alternations of speech and of music.