In one example of application, the invention may relate to coding alternating sounds of speech and music. CELP (Code-Excited Linear Prediction) techniques are generally recommended for effectively coding speech signals alone or superposed with any sound.
CELP coders are predictive coders whose purpose is to model speech production from various elements such as:                stochastic excitation (e.g. a white noise or algebraic excitation) modeling the flow of air emerging from the lungs in voiced and/or unvoiced periods,        a long-term prediction for modeling the vibration of vocal chords, in a voiced period in particular, and        a short-term prediction, in the form of an LPC (Linear Predictive Coding) filter with P coefficients, for modeling changes in the vocal tract, such as the pronunciation of voiced consonants.        
This number of coefficients P is chosen in order to fully model the formantic structure of the speech signal. The speech signal generally having four formants in the frequency band 0 to 4 kHz, ten filter coefficients correctly model this structure (two coefficients are needed for modeling each formant).
For a broadband signal sampled at 16 kHz, an LPC order of 16 coefficients is typically used.
The spectrum of a speech signal is shown in FIG. 1 (as a solid line) onto which is superimposed (as a dotted line) the frequency response of an LPC filter modeling its spectral envelope.
A sampled speech signal sn, filtered through such an LPC filter, has a residual signal rn such that:
            r      n        =                  s        n            -                        ∑                      i            =            1                    P                ⁢                              a            i                    ⁢                      s                          n              -              i                                            ,ai being the coefficients of the filter.
The power of the residual signal rn may be low and its spectrum flattened by a judicious choice of coefficients ai.
The residual signal is then simpler to code than the signal sn itself. It can easily be modeled by a harmonic, highly periodic, signal, as shown in FIG. 2, where X(f) is the spectrum of the original signal s (black line) and E(f) is the spectrum of the residual signal r (gray line).
The coefficients ai are typically calculated by measuring the correlation on the signal sn (and by applying a Levinson-Durbin type algorithm for inverting the Wiener-Hopf equations).
Thus there are two main component elements of CELP codecs:                a modeling of the vocal tract, via short-term prediction that models the spectral envelope in the form of an LPC filter and        a modeling of the excitation passing through the vocal tract, whether it is voiced or not.        
These two parametric elements, even though they model voice signals correctly, are not intended to faithfully reproduce musical audio or mixed signals (with superpositions of different speech and musical sound elements). In particular, the LPC filter modeling the spectral envelope is no longer suited to the simple voice signal and the excitation no longer fits the voiced/unvoiced model.
Notably in the implementation of the 3GPP AMR WB+ coder, a mixed speech/audio signal coding has been provided, which is improved in particular by better excitation coding. Coding via the LPC envelope is preserved, but the excitation coding is improved.
In addition to modeling by a long-term stochastic excitation predictor, transform coding may be added in cases where sounds do not fit the speech production model. This is termed ‘CELP+TCX’ (Transform Coded eXcitation). One such technique consists of the following steps:                LPC envelope coding estimation of the signal to be coded with a fixed number of coefficients,        selection of the excitation model (voiced/unvoiced parametric model or transform coding), and        transmission of the selected mode, the coded excitation and LPC envelope.        
Thanks to this choice of coding for excitation, the quality of the coding by AMR WB+ is satisfactory for audio signals consisting of mixtures of speech with background noise or speech with background music, and therefore typically for signals where speech dominates in energy. Indeed, for these signals, the envelope transmitted in LPC form is a relevant parameter since the signal is mainly composed of speech that is well described thanks to an LPC envelope of a given order. The envelope actually describes the formants (associated with the resonant frequencies of the vocal tract) as a function of the number of selected coefficients.
However, for signals with a low speech signal component—or even for signals not composed mainly of voice—the estimated LPC envelope transmitted to the coder is no longer sufficient. The audio signal is then often too complex to be limited, for example, to five formants and its evolution over time means that a fixed number of coefficients is not suitable.
Thus, for coding a complex sound, due to the limitation in coding the envelope, the coding effort is transferred to coding the excitation and the coder then loses its effectiveness.
One solution would consist in adapting the number of LPC coefficients transmitted over time, for the portions of the audio signal that require high accuracy for the envelope. This approach is, however, not viable since, in a low bitrate coding system, more accurate coding on the envelope would take away from the bitrate available for coding the excitation, and the quality would then not be improved as much.
Another solution would consist in performing a linear prediction with a ‘backward’ analysis such that the estimation of the LPC envelope no longer applies to the signal to be coded but to the previously decoded signal, it being possible for this ‘preceding’ signal to be identically available to the coder and the decoder. A saving can then be made on the transmission of the LPC envelope since it is possible to reconstruct it without information to the decoder, this saving being more useful in modeling the excitation for example. With regard to the coding of musical sounds, this linear prediction with ‘backward’ analysis can potentially be used to increase the number of filter coefficients modeling the envelope. Typically, an order of 50 can be used for fully modeling a musical signal and enable easy coding of the residual excitation signal.
On the other hand, the use of past information does not allow the changes in the audio signal to be anticipated since using a backward predictor is relevant for a stationary signal but the spectrum at a given frame is only accurately modeled and may be used for a following frame if the statistical and notably the spectral properties of the signal remain stable. Otherwise, the estimated LPC filter is not relevant for the frame considered and the residual signal then remains difficult to encode. The backward predictor therefore loses all its attraction.
A solution recommended in the prior art is therefore to use switching between a ‘forward’ prediction filter, calculated on the current frame, and a backward prediction filter, calculated on the previously received signal. The encoder analyzes the signal and decides whether the signal is stationary or not. If the signal is stationary, the backward filter is used. Otherwise, a forward filter with few coefficients is transmitted to the decoder. Such an embodiment can be used for accurate control over the quality of the residual signal to be encoded. It is implemented in ITU-T standard G.729-E, in which a decision on the stationarity of the signal results in a ‘backward’ estimated filter with 30 coefficients, or a ‘forward’ estimated filter with 10 coefficients.
The drawback of this technique lies mainly in combining these two estimation techniques. A discontinuous choice must be made, depending on the stationarity of the signal. In the case of a ‘slight’ non-stationarity like the appearance of an instrument in a musical ensemble, this new event should be considered in the signal and therefore a new forward filter should be sent. However, it may nevertheless be considered that the signal is sufficiently stable for the backward filter to be appropriate. Faced with such a dilemma situation, the coding system tends to often change configuration over time, in a relatively unpredictable way, causing distortion. Indeed, changing processing too often over time is not effective and the solution adopted is not necessarily the best.
In summary, the prior art recommends:                a fixed forward predictor, with few filter coefficients coarsely modeling the envelope,        a fixed backward predictor with a large number of coefficients, but which cannot model the signal variations from one frame to another,        alternating between the two types of predictors, which sometimes generates troublesome discontinuities.        