The present invention relates to the field of encoding/decoding digital signals, in particular for frame loss correction.
The invention advantageously applies to the encoding/decoding of sounds that may contain alternating or combined speech and music.
To code low bit-rate speech effectively, CELP (“Code Excited Linear Prediction”) techniques are recommended. To code music effectively, transform coding techniques are recommended.
CELP encoders are predictive coders. Their aim is to model speech production using various elements: short-term linear prediction to model the vocal tract, long-term prediction to model the vibration of vocal cords during voiced periods, and an excitation derived from a fixed codebook (white noise, algebraic excitation) to represent “innovation” that could not be modeled.
Transform coders such as MPEG AAC, AAC-LD, AAC-ELD or ITU-T G.722.1 Annex C use critically sampled transforms to compress the signal in the transform domain. The term “critically sampled transform” is used to refer to a transform for which the number of coefficients in the transform domain equals the number of time domain samples in each analyzed frame.
One solution for effective coding of a signal containing combined speech/music is to select the best technique over time between at least two coding modes: one of the CELP type, the other of the transform type.
This is the case for example for the codecs 3GPP AMR-WB+ and MPEG USAC (“Unified Speech Audio Coding”). The target applications for AMR-WB+ and USAC are not conversation but correspond to distribution and storage services, without severe constraints on the algorithmic delay.
The initial version of the USAC codec, called RM0 (Reference Model 0), is described in the article by M. Neuendorf et al, A Novel Scheme for Low Bitrate Unified Speech and Audio Coding—MPEG RM0, 7-10 May 2009, 126th AES Convention. This RM0 codec alternates between multiple coding modes:                For speech signals: LPD (“Linear Predictive Domain”) modes comprising two different modes derived from AMR-WB+ coding:                    an ACELP mode            a TCX (“Transform Coded Excitation”) mode called wLPT (“weighted Linear Predictive Transform”), using an MDCT transform (unlike the AMR-WB+ codec) which uses a FFT transform.                        For music signals: FD (“Frequency Domain”) mode using coding by MDCT (“Modified Discrete Cosine Transform”) of type MPEG AAC (“Advanced Audio Coding”) using 1024 samples.        
In the USAC codec, the transitions between LPD and FD modes are crucial to ensuring sufficient quality with no errors in switching between modes, knowing that each mode (ACELP, TCX, FD) has a specific “signature” (in terms of artifacts) and that the FD and LPD modes are of different types—FD mode is based on transform coding in the signal domain, while LPD modes use linear predictive coding in the perceptually weighted domain with filter memories to be properly managed. Management of the switching between modes in the USAC RM0 codec is detailed in the article by J. Lecomte et al., “Efficient cross-fade windows for transitions between LPC-based and non-LPC based audio coding”, 7-10 May 2009, 126th AES Convention. As explained in that article, the main difficulty lies in the transitions from LPD to FD modes and vice versa. We only discuss here the case of transitions from ACELP to FD.
To properly understand its function, we review the principle of MDCT transform coding using a typical example of its implementation.
In the encoder, an MDCT transformation is typically divided into three steps, the signal being subdivided into frames of M samples before MDCT coding:                Weighting the signal by a window referred to here as an “MDCT window” of length 2M;        Folding in the time domain (“time-domain aliasing”) to form a block of length M;        
DCT (“Discrete Cosine Transform”) transformation of length M.
The MDCT window is divided into four adjacent portions of equal lengths M/2, here called “quarters”.
The signal is multiplied by the analysis window, then the time-domain aliasing is carried out: the first quarter (windowed) is folded (in other words time-reversed and overlapped) over the second quarter and the fourth quarter is folded over the third.
More specifically, the time-domain aliasing of one quarter over another is done in the following manner: the first sample of the first quarter is added (or subtracted) to (from) the last sample of the second quarter, the second sample of the first quarter is added (or subtracted) to (from) the next-to-last sample of the second quarter, and so on, until the last sample of the first quarter which is added (or subtracted) to (from) the first sample of the second quarter.
From four quarters we thus obtain two lapped quarters where each sample is the result of a linear combination of two samples of the signal to be encoded. This linear combination induces a time-domain aliasing.
The two lapped quarters are then jointly encoded after DCT transformation (type IV). For the next frame, the third and fourth quarters of the preceding frame are then shifted by half a window (50% overlap) to then become the first and second quarters of the current frame. After lapping, a second linear combination of the same pairs of samples as in the preceding frame is sent, but with different weights.
In the decoder, after inverse DCT transformation we obtain the decoded version of these lapped signals. Two consecutive frames contain the result of two different overlaps of the same quarters, meaning that for each pair of samples we have the result of two linear combinations with different but known weights: a system of equations is thus solved to obtain the decoded version of the input signal, and the time-domain aliasing can thus be eliminated by the use of two consecutive decoded frames.
Solving the abovementioned equation systems can generally be done implicitly by undoing the folding, multiplying by a judiciously chosen synthesis window, then overlap-adding the common parts. This overlap-add also ensures a smooth transition (without discontinuities due to quantization errors) between two consecutive decoded frames, effectively acting as a cross-fade. When the window for the first quarter or the fourth quarter is at zero for each sample, we have an MDCT transformation without time-domain aliasing in that portion of the window. In such case, a smooth transition is not provided by the MDCT transformation and must be done by other means, for example an external cross-fade.
It should be noted that variant implementations of the MDCT transformation exist, in particular concerning the definition of the DCT transform, the manner of folding the block to be transformed (for example, one can reverse the signs applied to the folded quarters on the left and right, or fold the second and third quarters respectively over the first and fourth quarters), etc. These variants do not change the principle of MDCT analysis-synthesis with reduction of the sample block by windowing, time-domain aliasing, then transformation and finally windowing, folding, and overlap-add.
To avoid artifacts at the transitions between CELP coding and MDCT coding, international patent application WO02012/085451, which is hereby incorporated by reference in the present application, provides a method for coding a transition frame. The transition frame is defined as a current frame encoded by transform which is the successor of a preceding frame encoded by predictive coding. According to said novel method, a portion of the transition frame, for example a sub-frame of 5 ms in the case of core CELP coding at 12.8 kHz, and two additional CELP frames of 4 ms each in the case of core CELP coding at 16 kHz, are encoded by a predictive coding that is more limited than the predictive coding of the preceding frame.
Limited predictive coding consists of using the stable parameters of the preceding frame encoded by predictive coding, for example the coefficients of the linear prediction filter, and coding only a few minimal parameters for the additional sub-frame in the transition frame.
As the preceding frame was not encoded with transform coding, it is impossible to undo the time-domain aliasing in the first part of the frame. The patent application WO2012/085451 cited above further proposes modifying the first half of the MDCT window to have no time-domain aliasing in the normally-folded first quarter. It also proposes integrating a portion of the overlap-add (also called “cross-fade”) between the decoded CELP frame and the decoded MDCT frame while changing the coefficients of the analysis/synthesis window. Referring to FIG. 4e of said patent application, the broken lines (alternating dots and dashes) correspond to the folding lines of the MDCT encoding (top figure) and to the unfolding lines of the MDCT decoding (bottom figure). In the upper figure, bold lines separate the frames of new samples entering the encoder. The encoding of a new MDCT frame can begin when a thusly defined frame of new input samples is completely available. It is important to note that these bold lines in the encoder do not correspond to the current frame but to the block of new incoming samples for each frame: the current frame is actually delayed by 5 ms, corresponding to a lookahead. In the bottom figure, bold lines separate the decoded frames at the decoder output.
In the encoder, the transition window is zero until the folding point. Thus the coefficients of the left side of the folded window will be identical to those of the unfolded window. The portion between the folding point and the end of the CELP transition sub-frame (TR) corresponds to a sine (half-) window. In the decoder, after unfolding, the same window is applied to the signal. In the segment between the folding point and the beginning of the MDCT frame, the coefficients of the window correspond to a window of type sine. To achieve the overlap-add between the decoded CELP sub-frame and the signal from the MDCT, it is sufficient to apply a window of type cos2 to the overlap portion of the CELP sub-frame and to add the latter with the MDCT frame. The method provides a perfect reconstruction.
However, encoded audio signal frames may be lost in the channel between the encoder and the decoder.
Existing frame-loss correction techniques are often highly dependent on the type of coding used.
In the case of speech coding based on predictive technology, such as CELP for example, frame loss correction is often tied to the speech model. For example, the ITU-T G.722.2 standard, in its version of July 2003, proposes replacing a lost packet by extending the long-term prediction gain while attenuating it, and extending the frequency spectral lines (ISF for “Immittance Spectral Frequencies”) representing the A(z) coefficients of the LPC filter, while causing them to trend towards their respective averages. The pitch period is also repeated. The fixed codebook contribution is filled with random values. Application of such methods to transform or PCM decoders requires CELP analysis in the decoder, which would introduce significant added complexity. Note also that more advanced methods of frame loss correction in CELP decoding are described in the ITU-T G.718 standard, for rates of 8 and 12 kbit/s as well as for decoding rates that are interoperable with AMR-WB.
Another solution is presented in the ITU-T G.711 standard, which describes a transform coder for which the frame loss correction algorithm, discussed in the “Appendix I” section, consists of finding a pitch period in the already decoded signal and repeating it by applying an overlap-add between the already decoded signal and the repeated signal. This overlap-add erases audio artifacts but requires additional time in the decoder (corresponding to the duration of the overlap-add) in order to implement it.
In the case of transform coding, a common technique for correcting frame loss is to repeat the last frame received. Such a technique is implemented in various standardized encoders/decoders (G.719, G.722.1, and G.722.1C in particular). For example, in the case of the G.722.1 decoder, an MLT transform (“Modulated Lapped Transform”), equivalent to an MDCT transform with an overlap of 50% and a sine window, ensures a sufficiently slow transition between the last lost frame and the repeated frame to erase artifacts related to simple repetition of the frame.
There is little cost to such a technique, but its main deficiency is the inconsistency between the signal just before the frame loss and the repeated signal. This results in a phase discontinuity that can introduce significant audio artifacts if the duration of the overlap between the two frames is small, as is the case where the windows used for the MLT transform are low-delay windows.
In existing techniques, when a frame is missing a replacement frame is generated in the decoder using an appropriate PLC (packet loss concealment) algorithm. Note that generally a packet can contain multiple frames, so the term PLC can be ambiguous; it is used here to indicate the correction of the current lost frame. For example, after a CELP frame is correctly received and decoded, if the following frame is lost, a replacement frame based on a PLC appropriate for CELP coding is used, making use of the memory of the CELP coder. After an MDCT frame is correctly received and decoded, if the next frame is lost, a replacement frame based on a PLC appropriate for MDCT coding is generated.
In the context of the transition between CELP and MDCT frames, and considering that the transition frame is composed of a CELP sub-frame (which is at same sampling frequency as the directly preceding CELP frame) and a MDCT frame comprising a modified MDCT window canceling out the “left” folding, there are situations where the existing techniques do not provide a solution.
In a first situation, a previous CELP frame has been correctly received and decoded, a current transition frame has been lost, and the next frame is an MDCT frame. In this case, after reception of the CELP frame, the PLC algorithm does not know that the lost frame is a transition frame and therefore generates a replacement CELP frame. Thus, as previously explained, the first folded portion of the next MDCT frame cannot be compensated for and the time between the two types of encoder cannot be filled with the CELP sub-frame contained in the transition frame (which was lost with the transition frame). No known solution addresses this situation.
In a second situation, a previous CELP frame at 12.8 kHz has been correctly received and decoded, a current CELP frame at 16 kHz has been lost, and the next frame is a transition frame. The PLC algorithm then generates a CELP frame at the frequency of the last frame received correctly, which is 12.8 kHz, and the transition CELP sub-frame (partially encoded using CELP parameters of the lost CELP frame at 16 kHz) cannot be decoded.
The present invention aims to improve this situation.
To this end, a first aspect of the invention relates to a method for decoding a digital signal encoded using predictive coding and transform coding, comprising the following steps:                predictive decoding of a preceding frame of the digital signal, encoded by a set of predictive coding parameters;        detecting the loss of a current frame of the encoded digital signal;        generating, by prediction, from at least one predictive coding parameter encoding the preceding frame, a replacement frame for the current frame;        generating, by prediction, from at least one predictive coding parameter encoding the preceding frame, an additional segment of digital signal;        temporarily storing this additional segment of digital signal.        
Thus, an additional segment of digital signal is available whenever a replacement CELP frame is generated. The predictive decoding of the preceding frame covers the predictive decoding of a correctly received CELP frame or the generation of a replacement CELP frame by a PLC algorithm suitable for CELP.
This additional segment makes a transition possible between CELP coding and transform coding, even in the case of frame loss.
Indeed, in the first situation described above, the transition to the next MDCT frame can be provided by the additional segment. As is described below, the additional segment can be added to the next MDCT frame to compensate for the first folded portion of this MDCT frame by means of a cross-fade in the region containing the time-domain aliasing that has not been undone.
In the second situation described above, decoding of the transition frame is made possible by use of the additional segment. If it is not possible to decode the transition CELP sub-frame (unavailability of CELP parameters of the preceding frame coded at 16 kHz), it is possible to replace it with the additional segment as described below.
Moreover, the calculations related to frame loss management and the transition are spread over time. The additional segment is generated and stored for each replacement CELP frame generated. The transition segment is therefore generated when a frame loss is detected, without waiting for subsequent detection of a transition. The transition is thus anticipated with each frame loss, which avoids having to manage a “complexity spike” at the time when a correct new frame is received and decoded.
In one embodiment, the method further comprises the steps of:                receiving a next frame of encoded digital signal comprising at least one segment encoded by transform; and        decoding the next frame, comprising a sub-step of overlap-adding the additional segment of digital signal and the segment encoded by transform. The overlap-add sub-step makes it possible to cross-fade the output signal. Such a cross-fade reduces the appearance of sound artifacts (such as “ringing noise”) and ensures consistency in the signal energy.        
In another embodiment, the next frame is entirely encoded by transform coding and the lost current frame is a transition frame between the preceding frame encoded by predictive coding and the next frame encoded by transform coding.
Alternatively, the preceding frame is encoded by predictive coding via a core predictive coder operating at a first frequency. In this variant, the next frame is a transition frame comprising at least one sub-frame encoded by predictive coding via a core predictive coder operating at a second frequency that is different from the first frequency. For this purpose, the next transition frame may comprise a bit indicating the frequency of the core predictive coding used.
Thus, the type of CELP coding (12.8 or 16 kHz) used in the transition CELP sub-frame can be indicated in the bit stream of the transition frame. The invention thus adds a systematic indication (one bit) to a transition frame, to enable detection of a frequency difference in the CELP encoding/decoding between the transition CELP sub-frame and the preceding CELP frame.
In another embodiment, the overlap-add is given by applying the following formula which employs linear weighting:
      S    ⁡          (      i      )        =                    B        ⁡                  (          i          )                    .              i                  (                      L            /            r                    )                      +                  (                  1          -                      i                          (                              L                /                r                            )                                      )            .              T        ⁡                  (          i          )                    where:
r is a coefficient representing the length of the generated additional segment;
i is a time of a sample of the next frame, between 0 and L/r;
L is the length of the next frame;
S(i) is the amplitude of the next frame after addition, for sample i;
B(i) is the amplitude of the segment decoded by transform, for sample i;
T(i) is the amplitude of the additional segment of digital signal, for sample i.
The overlap-add can therefore be done using linear combinations and operations that are simple to implement. The time required for decoding is thus reduced while placing less load on the processor or processors used for these calculations. Alternatively, other forms of cross-fade can be implemented without changing the principle of the invention.
In one embodiment, the step of generating, by prediction, the replacement frame further comprising an updating of the internal memories of the decoder, the step of generating, by prediction, an additional segment of digital signal may comprise the following sub-steps:
copying, to a temporary memory, from memories of the decoder that were updated during the generation by prediction of the replacement frame;
generating the additional segment of digital signal, using the temporary memory.
Thus, the internal memories of the decoder are not updated for the generation of the additional segment. As a result, the generation of the additional signal segment does not impact the decoding of the next frame, in the case where the next frame is a CELP frame.
Indeed, if the next frame is a CELP frame, the internal memories of the decoder must correspond to the states of the decoder after the replacement frame.
In one embodiment, the step of generating, by prediction, an additional segment of digital signal comprises the following sub-steps:                generating, by prediction, an additional frame from at least one predictive coding parameter encoding the preceding frame;        extracting a segment of the additional frame.        
In this embodiment, the additional segment of digital signal corresponds to the first half of the additional frame. The efficiency of the method is thus further improved because the temporary calculation data used for generating the replacement CELP frame are directly available for generation of the additional CELP frame. Typically, the registers and caches in which the temporary calculation data are stored do not have to be updated, enabling direct reuse of these data for generation of the additional CELP frame.
A second aspect of the invention provides a computer program comprising instructions for implementing the method according to the first aspect of the invention, when these instructions are executed by a processor.
A third aspect of the invention provides a decoder for a digital signal encoded using predictive coding and transform coding, comprising:                a detection unit for detecting the loss of a current frame of the digital signal;        a predictive decoder comprising a processor arranged to carry out the following operations:                    predictive decoding of a preceding frame of the digital signal, coded by a set of predictive coding parameters;            generating, by prediction, from at least one predictive coding parameter encoding the preceding frame, a replacement frame for the current frame;            generating, by prediction, from at least one predictive coding parameter encoding the preceding frame, an additional segment of digital signal;            temporarily storing this additional segment of digital signal in temporary memory.                        
In one embodiment, the decoder according to the third aspect of the invention further comprises a transform decoder comprising a processor arranged to carry out the following operations:                receiving a next frame of encoded digital signal comprising at least one segment encoded by transform; and        decoding the next frame, comprising a sub-step of overlap-add between the additional segment of digital signal and the segment encoded by transform.        
In the encoder, the invention may comprise the insertion into the transition frame of a bit providing information about the CELP core used for coding the transition sub-frame.