Compression coders use properties of the digital audio signal such as its local stationarity, utilized by short-term prediction filters, as well as its harmonic structure, utilized by LTP long-term prediction filters. Typically, the voiced sounds of a speech signal (such as the vowels) exhibit a long-term correlation due to the vibration of the vocal cords. The long-term correlation is modeled by an LTP filter denoted P(z) which makes it possible to retrieve the harmonic structure by using a synthesis filter of the type:
            H      LT        ⁡          (      z      )        =      1          1      -              P        ⁡                  (          z          )                    
The simplest form of the long-term prediction filter is the filter P(z) with a single coefficient β (also called the gain) and integer delay T such that P(Z)=βZ−T. The delay T is also called the “pitch” period, or more simply the “pitch”.
Currently, more elaborate modelings are aimed at:                modeling with several coefficients (termed “multitap”):        
            P      ⁡              (        z        )              =                  ∑                  i          =                      -            k                          k            ⁢                        β          i                ⁢                  z                                    -              T                        -            i                                ,                or else modeling with multiple delays:        
            P      ⁡              (        z        )              =                  ∑                  i          =                      -            1                          k            ⁢                        β          i                ⁢                  z                      -            iT                                ;                or else modeling with a fractional delay which uses over- and under-samplings with interpolation filters:        
            P      ⁡              (        z        )              =          β      ⁢                        ∑                      i            =            0                                              2              ⁢              I                        -            1                          ⁢                                            p              l                        ⁡                          (              i              )                                ⁢                      z                          -                              (                                  T                  -                  l                  +                  i                                )                                                          ,
where for a delay (T+1/D), of resolution 1/D, the coefficients p1(i) are given by p1(i)=hinter (iD−1), 0≦1≦D−1, hinter being an interpolation filter of length 2ID+1.
The parameters of the filter (delay and gain(s)) vary according to the signals to be coded and for one and the same signal over time. For example, in speech coding, the span of the pitch periods seeks to cover the range of the fundamental frequencies of the human voice (from low voices to high voices). For one and the same talker, this frequency also varies temporally. Likewise, the coefficient(s) of the filter also evolves(evolve) over time.
On coding, the parameters of P(z) are determined either by an open-loop analysis or by a closed-loop analysis or usually by a combination of both analyses. The open-loop analysis is performed by minimizing the prediction error in the signal to be modeled. The closed-loop analysis (termed “analysis by synthesis”) minimizes the quadratic error, usually weighted, between the voice signal to be modeled and the synthesis signal. Usually, an open-loop search is firstly envisaged so as to determine a first estimate of the pitch called the “open-loop pitch”. Then, a search based on analysis by synthesis over a restricted neighborhood around this anchoring value makes it possible to obtain a more accurate value of the pitch. These analyses are performed on blocks of samples. The lengths of the open-loop and closed-loop analysis blocks are not necessarily equal. Often, a single open-loop analysis is performed for several closed-loop analyses.
For any LTP model (monotap or multitap), the determination of the LTP parameters is very expensive in terms of calculational complexity. It generally consists of an open loop over a large block of samples followed by closed loops over several sub-blocks of samples (also called subframes). In particular, the open-loop search for the harmonic lag is a very expensive operation, on coding. Usually, it requires the calculation of an auto-correlation function of the signal for numerous values (in fact over a span of variation of the delays). In the coder according to the UIT-T G.723.1 standard, this span of delays comprises 125 integer delays (from 18 to 142) and the open-loop delay is estimated every 15 ms (i.e. therefore for blocks of 120 samples). In the coder according to the 8-kbits/s UIT-T G.729 standard, the open-loop analysis is performed every 10 ms (at each block of 80 samples) and explores a span of 124 integer delays (from 20 to 143). This operation constitutes nearly 70% of the complexity of the LTP analysis for this type of coding.
Even though it is focused around the delay obtained in open loop, the closed loop is also extremely expensive in terms of calculations and, consequently, resources. It requires the generation of adaptive excitations and their filtering. For example, in the G.723.1 coding which uses a multitap LTP model, the closed-loop analysis jointly determines the vector of gains (βi) and a lag λ (in the guise of candidate pitch) of each subframe by exploring a dictionary of gain vectors for several candidate pitch values. This analysis constitutes nearly half the total complexity of the 5.3-kbits/s G.723.1 coder.
The complexity of the LTP analysis is especially critical when several codings must be performed by one and the same processing unit such as a gateway responsible for managing numerous communications in parallel or a server distributing numerous multimedia contents. The problem of complexity is further increased by the multiplicity of compression formats which circulate around the networks. Several codings are then envisaged, either in cascade (or “transcoding”), or in parallel (multi-format coding or multi-mode coding). Transcoding is typically used when, in a transmission chain, a compressed signal frame sent by a coder can no longer continue its path, in this format. Transcoding makes it possible to convert this frame into another format compatible with the rest of the transmission chain. The most elementary solution (and the commonest at present) is to abut a decoder and a coder. The compressed frame, arriving in a first format, is decompressed. This decompressed signal is then re-compressed into a second format accepted by the rest of the communication chain. This cascading of a decoder and a coder is called “tandem”. Nevertheless, this solution is very expensive in terms of complexity (essentially because of the recoding) and degrades the quality, the second coding being done in fact on a decoded signal which is a degraded version of the original signal. Additionally, a frame may encounter several tandems before arriving at its destination, thereby further increasing the cost in terms of calculation and the loss of quality. Furthermore, the delays related to each tandem operation accumulate and may be detrimental to the interactivity of the communications.
As regards the multi-format compression systems where one and the same content is compressed in several formats (typically in the case of content servers which broadcast one and the same content in several formats suited to the conditions of access, networks and terminals of the various end users), the multi-coding operation becomes extremely complex as the number of desired formats increases, and this may rapidly saturate the resources of the systems. Another case of multiple coding in parallel is multi-mode compression with a posteriori decision according to which, at each signal segment to be coded, several compression modes are executed and the mode which optimizes a given criterion or obtains the best throughput/distortion compromise is selected. Here again, the complexity of each of the compression modes limits their number and/or leads to a very restricted number of modes being selected a priori.
Currently, most multiple coding operations do not yet take full account of the similarities between coding formats, and this could however reduce the complexity and the algorithmic delay while limiting the degradation introduced. For one and the same coding format parameter, the differences between coders reside in the modeling, the procedure and/or the frequency of calculation, or else the quantization.
Generally, the solutions proposed today endeavor to limit the number of values explored for the parameters of a second LTP model by using the parameters chosen by the first format, to reduce the complexity of the LTP search for the second format.
Transcoding between two monotap LTP models is the simplest case. Most of the currently proposed procedures relate to transcoding between delays, the transcoding of the LTP gain usually being performed at the actual signal level (one speaks of “partial” tandem) when the two models are identical (the same dictionary of delays and same subframe length), a simple copy of the binary fields of the delays from one bit stream to the other is sufficient. When the dictionaries differ by their resolution (integer or fractional ⅓, ⅙, etc.) and/or by their spans of values, a transcoding into the binary or parameter domain, with a possible transformation, is used. The transformation may be a quantization, a truncation, a doubling or a splitting. When the lengths of the subframes of the two formats are different, an interpolation of the delays may be provided. For example, the delays of a first format overlapping an output subframe are interpolated. It is then possible to use this interpolated delay only when the latter is close to the delay obtained at the previous subframe, otherwise a conventional search is conducted. Another more direct procedure, without interpolation, consists in selecting a delay from among these delays of the first format. This selection may be made according to several criteria: last subframe, subframe having the most samples in common with the subframe of the second format or else that which maximizes a criterion which depends on the LTP gain. The delay determined is an anchoring value for the search for the delay of the second format. It may be used as open-loop delay of the second format around which a conventional or restricted closed-loop search is performed, or as a first estimate of it, or as anchoring of a delay trajectory.
In the case of a transcoding between a monotap LTP modeling and a multitap LTP modeling, the only implementation that is provided for at present is simply in the signal domain, owing to the dissimilarity of the modelings. Most of the existing transcoding techniques limit themselves to reducing the complexity of the open loop of the second format by selecting one of the delays of the first format or an interpolation of these delays as open-loop delay. However, a few techniques have been proposed for also reducing the complexity of the closed loop.
In document WO-03058407, the fractional delay λ′ of a monotype model is determined on the basis of the vector of coefficients (βi) of a multitap model by calculating the expression:
      λ    ′    =      λ    -                            ∑                      j            =                          -              2                                2                ⁢                  jβ          j          2                                      ∑                      j            =                          -              2                                2                ⁢                  β          j          2                    
In document reference [1]:
“An Efficient Transcoding Algorithm for G.723.1 and G.729A Speech Coders”, Sung-Wan Yoon, Sung-Kyo Jung, Young-Cheol Park, and Dae-Hee Youn, Proc. Eurospeech 2001, pp. 2499-2502,
the closed-loop search for the vector of gains of a multitap model is restricted to a subset of the dictionary of multitap gains, which is determined by the gain of the monotap model of the first format. This determination, as well as the composition of the subsets are performed as follows: the global gain of each vector of the dictionary of gains is calculated; next, on the basis of 170 global gains corresponding to the 170 vectors of the dictionary, 8 subsets are constructed and a single one of these subsets is selected depending on the LTP gain of the first monotap model.
In a variant according to the document referenced [2]:
“Transcoding algorithm for G723.1 and AMR Speech Coders: for Interoperability between VoIP and Mobile Networks”, Sung-Wan Yoon and al., Proc. Eurospeech 2003, pp. 1101-1104,
the subsets are built up by learning as follows: the span of variation of the monotap gain of an NB-AMR coder is divided into 8 subsections, then, for each subsection, a statistical study on an NB-AMR tandem makes it possible to determine M vectors of gains of the dictionaries of a coder according to the G.723.1 standard. These gain vectors are statistically the most probable. The number M is taken equal to 40 for the dictionary comprising 85 vectors and to 85 for the dictionary comprising 170 vectors. During the search for the optimal vector of gains, the exploration of the dictionary is limited to the subset associated with the subsection to which the gain of the NB-AMR coder belongs.
To the knowledge of the inventors, there is at present no technique for transcoding between two multitap LTP modelings. As was seen above, most of the current solutions relate only to monotap LTP models. Certain techniques propose a transcoding between a multitap model and a monotap model but limit themselves to reducing the complexity of the search for the open-loop delay of the second format.
Among the few approaches proposed for reducing the complexity of the closed loop, some are based on approximating a multitap LTP filter by a monotap LTP filter (fractional or otherwise). For example, in the case of an approximation of a multitap filter:
            P      multi        ⁡          (      z      )        =            ∑              i        =                  -          k                    k        ⁢                  β        i            ⁢              z                              -            T                    -          i                    by a nonfractional monotap filter Pmono(z)=βz−(T−δ),a gain β and a delay jitter δ are estimated such that: Pmono(z)≈Pmulti(z), for all the integer delays T considered.
The approximating of a multitap LTP model by a monotap LTP model has already been utilized from the UIT-T G.723.1 standard, in fact to estimate the adaptive prefilter and also to control the instability of the LTP filter. The studies conducted during the design of the coder according to the G.723.1 standard have shown that it is not always possible to satisfactorily approximate a multitap LTP filter by a monotap LTP filter, over a wide span of delays, with the same gain β and the same jitter δ in the delay. For one and the same vector of gains (βi), the estimate of the optimal pair (β, δ) may vary greatly as a function of the delay T. In the coder according to the G.723.1, it has been possible to overcome this difficulty since the stability control procedure picks out the maximum gain from among the estimated gains (which may then be very dissimilar) and the adaptive prefilter is disabled for any vector of gains of the multitap model when, over the relevant span of delays, the estimated gains are too different or the jitters in the delay are too dissimilar or too large. If, for the modules for adaptive pre-filtering and instability control of the long-term prediction filter, it is possible, to overcome the difficulty of estimation without degrading performance, these advantages are more difficult to achieve with the LTP analysis module itself which plays a crucial role with regard to quality. Thus, according to the vector of gains and/or the delay considered, the 170 global gains calculated for each vector of the 170 entries of the dictionary, as seen in the prior art above [1], may be very far from the optimal gains. Likewise, according to the vector of gains (βi) and/or the delay λ, the calculation of the fractional delay λ′, as seen in the prior art WO-03058407 hereinabove, may lead to a poor determination of the fractional delay.
Whether the approach be analytical or statistical, the approximating, over a wide range of delays, of a multitap LTP filter by a single monotap LTP filter (or the inverse approximation) is too inaccurate. To solve this problem, it would, in order to take account of the variation of the gain β and/or of the jitter δ according to the delay T, be possible to store a pair (β,δ) for each delay T. However, this solution would be too expensive in terms of storage since it would require the storage of a pair for each gain vector and for each delay of the span. In the case of the approximation of the multitap LTP filters of the G.723.1 code, which comprises two multitap dictionaries of 170 and 85 vectors, with a span of 125 delays, it would be necessary to store 31875 (=125*(170+85)) pairs. Moreover, this solution would not solve the cases where the approximation of a multitap by a monotap is really too inaccurate, or even erroneous. It will be noted that conversely, several pairs (β,δ) may also constitute good approximations of a multitap LTP filter.
The present invention intends to improve the situation.