In order to effectively code speech sounds, CELP (Code Excited Linear Prediction) type techniques are recommended. In order to effectively code musical sounds, transform coding techniques are recommended in preference.
Encoders of the CELP type are predictive encoders. Their purpose is to model the production of speech based on various elements: a short-term linear prediction for modeling the vocal tract, a long-term prediction for modeling the vibration of the vocal chords in the voiced period, and an excitation derived from a fixed dictionary (white noise, algebraic excitation) in order to represent the “innovation” that has not been able to be modeled.
The transform encoders that are most widely used (the MPEG AAC or ITU-T G.722.1 Annex C encoder for example) use critical sampling transforms in order to compact the signal in the transform domain. “Critical sampling transform” is a transform for which the number of coefficients in the transform domain is equal to the number of temporal samples analyzed.
One solution for effectively coding a signal containing these two types of content consists in selecting the best technique over time. This solution has been notably recommended by the 3GPP (3rd Generation Partnership Project) standardization organization, and a technique called AMR WB+ has been proposed.
This technique is based on a CELP technology of the AMR-WB type, more specifically of the ACELP (for “Algebraic Code Excited Linear Prediction”) type, and transform coding based on an overlap Fourier transform in a model of the TCX (for “Transform Coded eXcitation”) type.
The ACELP coding and the TCX coding are both techniques of predictive linear type. It should be noted that the AMR-WB+ codec has been developed for the 3GPP PSS (for “Packet Switched Streaming”), MBMS (for “Multimedia Broadcast/Multicast Service”) and MMS (for “Multimedia Messaging Service”) services, in other words for broadcasting and storage services with no strong constraints on the algorithmic delay.
This solution suffers from insufficient quality on the music. This insufficiency comes particularly from the transform coding. In particular, the overlap Fourier transform is not a critical sampling transformation, and therefore it is suboptimal.
Moreover, the windows used in this encoder are not optimal with respect to the concentration of energy: the frequency shapes of these virtually rectangular windows are suboptimal.
An improvement of the AMR-WB+ coding combined with the principles of MPEG AAC (for “Advanced Audio Coding”) coding is given by the MPEG USAC (for “Unified Speech Audio Coding”) codec which is still being developed at the ISO/MPEG. The applications targeted by MPEG USAC are not conversational, but correspond to broadcasting and storage services with no strong constraints on the algorithmic delay.
The initial version of the USAC codec, called RM0 (Reference Model 0), is described in the article by M. Neuendorf et al., A Novel Scheme for Low Bitrate Unified Speech and Audio Coding—MPEG RM0, 7-10 May 2009, 126th AES Convention. This RM0 codec alternates between several coding modes:                For the signals of the speech type: LPD (for “Linear Predictive Domain”) modes comprising two different modes derived from AMR-WB+ coding:                    An ACELP mode            A TCX mode called wLPT (for “weighted Linear Predictive Transform”) using a transform of the MDCT type (unlike the AMR-WB+ codec).                        For the signals of the music type: FD (for “Frequency Domain”) mode using MDCT (for “Modified Discrete Cosine Transform”) transform coding of the MPEG AAC (for “Advanced Audio Coding”) type on 1024 samples.        
Compared with the AMR-WB+ codec, the various majors provided by the USAC RM0 coding for the mono part are the use of a critical decimation transform of the MDCT type for the transform coding and the quantization of the MDCT spectrum by scalar quantization with arithmetic coding. It should be noted that the acoustic band coded by the various modes (LPD, FD) depends on the selected mode, which is not the case in the AMR-WB+ codec where the ACELP and TCX modes operate at the same internal sampling frequency. Moreover, the decision concerning mode in the USAC RM0 codec is carried out in an open loop for each frame of 1024 samples. Note that a closed-loop decision is made by executing the various coding modes in parallel and by choosing a posteriori the mode that gives the best result according to a predefined criterion. In the case of an open-loop decision, the decision is taken a priori as a function of the data and of the observations available but without testing whether this decision is optimal or not.
In the USAC codec, the transitions between LPD and FD modes are crucial for ensuring sufficient quality without failure of switching, knowing that each mode (ACELP, TCX, FD) has a specific “signature” (in terms of artifacts) and that the FD and LPD modes are of different kinds—the FD mode is based on transform coding in the domain of the signal, while the LPD modes use predictive linear coding in the field that is perceptually weighted with filter memories to be managed correctly. The management of intermode switchings in the USAC RM0 codec is explained in detail in the article by J. Lecomte et al., “Efficient cross-fade windows for transitions between LPC-based and non-LPC based audio coding”, 7-10 May 2009, 126th AES Convention. As explained in this article, the main difficulty lies in the transitions between LPD to FD modes and vice-versa. All that is retained here is the case of the transitions from ACELP to FD.
In order to fully understand the operation, here is a recap on the principle of MDCT transform coding through a typical exemplary embodiment.
At the encoder, the MDCT transformation is divided between three steps:                Weighting of the signal by a window called in this instance the “MDCT window” with a length of 2M        Time-domain aliasing in order to form a block of length M        DCT (for “Discrete Cosine Transform”) transformation of length M.        
The MDCT window is divided into 4 adjacent portions of equal length M/2, called “quarts”.
The signal is multiplied by the analysis window and then the aliasings are carried out: the first quart (windowed) is aliased (that is to say inverted in time and made to overlap) on the second quart and the fourth quart is aliased on the third.
More precisely, the aliasing of one quart on another is carried out in the following manner: the first sample of the first quart is added to (or subtracted from) the last sample of the second quart, the second sample of the first quart is added to (or subtracted from) the penultimate sample of the second quart, and so on to the last sample of the first quart which is added to (or subtracted from) the first sample of the second quart.
This therefore gives, on the basis of 4 quarts, 2 aliased quarts in which each sample is the result of a linear combination of 2 samples of the signal to be coded. This linear combination is called time-domain aliasing.
These 2 aliased quarts are then coded jointly after DCT transformation. For the following frame, there is a half-offset of a window (50% of overlap), the third and fourth quarts of the preceding frame then become the first and second quarts of the current frame. After aliasing, a second linear combination of the same pairs of samples is sent as in the preceding frame, but with different weights.
At the decoder, after inverse DCT transformation, the decoded version of these aliased signals is then obtained. Two consecutive frames contain the result of 2 different aliasings of the same quarts, that is to say for each pair of samples there is the result of 2 linear combinations with different but known weights: an equation system is therefore resolved in order to obtain the decoded version of the input signal; the time-domain aliasing can therefore be removed by using 2 consecutive decoded frames.
The resolution of the equation systems mentioned is usually carried out by anti-aliasing, multiplication by a carefully chosen synthesis window and then addition-overlapping of the common parts. This addition-overlapping at the same time provides the soft transition (without discontinuity due to the quantization errors) between 2 consecutive decoded frames; specifically this operation behaves like a cross-fade. When the window for the first quart or the fourth quart is at zero for each sample, it is called an MDCT transformation without time-domain aliasing in this part of the window. In this case, the soft transition is not ensured by the MDCT transformation; it must be carried out by other means such as for example an external cross-fade.
It should be noted that variant embodiments of the MDCT transformation exist, in particular on the definition of the DCT transform, on how to time-domain aliase the block to be transformed (for example, it is possible to invert the signs applied to the aliased quarts to the left and the right, or to aliase the second and third quarts on respectively the first and fourth quarts), etc. These variants do not change the principle of the MDCT synthesis-analysis with the reduction of the block of samples by windowing, time-domain aliasing and then transformation and finally windowing, aliasing and addition-overlapping.
In the case of the USAC RM0 encoder described in the article by Lecomte et al., the transition between a frame coded by ACELP coding and a frame coded by FD coding takes place in the following manner:
A transition window for the FD mode is used with an overlap to the left of 128 samples, as illustrated in FIG. 1. The time-domain aliasing on this overlap zone is canceled out by introducing an “artificial” time-domain aliasing on the right of the reconstructed ACELP frame. The MDCT window used for the transition has a size of 2304 samples and the DCT transformation operates on 1152 samples while normally the frames of the FD mode are coded with a window with a size of 2048 samples and a DCT transformation of 1024 samples. Thus the MDCT transformation of the normal FD mode cannot be directly used for the transition window; the encoder must also incorporate a modified version of this transformation which complicates the implementation of the transition for the FD mode.
These coding techniques of the prior art, AMR-WB+ or USAC, have algorithmic delays of the order of 100 to 200 ms. These delays are incompatible with conversational applications for which the coding delay is usually of the order of 20-25 ms for the speech encoders for mobile applications (e.g.: GSM EFR, 3GPP AMR and AMR-WB) and of the order of 40 ms for the conversational transform encoders for videoconference (e.g.: ITU-T G.722.1 Annex C and G.719).
There is therefore a need for coding that alternates the techniques of predictive and transform coding for applications of coding sounds having alternating speech and music with a good coding quality at the same time of the speech and of the music and an algorithmic delay that is compatible with conversational applications, typically of the order of 20 to 40 ms for frames of 20 ms.