The present invention relates to encoders for encoding an audio signal, in particular a speech related audio signal. The present invention also relates to decoders and methods for decoding an encoded audio signal. The present invention further relates to encoded audio signals and to an advanced speech unvoiced coding at low bitrates.
At low bitrate, speech coding can benefit from a special handling for the unvoiced frames in order to maintain the speech quality while reducing the bitrate. Unvoiced frames can be perceptually modeled as a random excitation which is shaped both in frequency and time domain. As the waveform and the excitation looks and sounds almost the same as a Gaussian white noise, its waveform coding can be relaxed and replaced by a synthetically generated white noise. The coding will then consist of coding the time and frequency domain shapes of the signal.
FIG. 16 shows a schematic block diagram of a parametric unvoiced coding scheme. A synthesis filter 1202 is configured for modeling the vocal tract and is parameterized by LPC (Linear Predictive Coding) parameters. From the derived LPC filter comprising a filter function A(z) a perceptual weighted filter can be derived by weighting the LPC coefficients. The perceptual filter fw(n) has usually a transfer function of the form:
      Ffw    ⁡          (      z      )        =            A      ⁡              (        z        )                    A      ⁡              (                  z          /          w                )            wherein w is lower than 1. The gain parameter gn is computed for getting a synthesized energy matching the original energy in the perceptual domain according to:
      g    n    =                              ∑                      n            =            0                    Ls                ⁢                                  ⁢                              sw            2                    ⁡                      (            n            )                                                ∑                      n            =            0                    Ls                ⁢                                  ⁢                              nw            2                    ⁡                      (            n            )                              where sw(n) and nw(n) are the input signal and generated noise, respectively, filtered by the perceptual filter fw(n). The gain gn is computed for each subframe of size Ls. For example, an audio signal may be divided into frames with a length of 20 ms. Each frame may be subdivided into subframes, for example, into four subframes, each comprising a length of 5 ms.
Code excited linear prediction (CELP) coding scheme is widely used in speech communications and is a very efficient way of coding speech. It gives a more natural speech quality than parametric coding but it also requests higher rates. CELP synthesizes an audio signal by conveying to a Linear Predictive filter, called LPC synthesis filter which may comprise a form 1/A(z), the sum of two excitations. One excitation is coming from the decoded past, which is called the adaptive codebook. The other contribution is coming from an innovative codebook populated by fixed codes. However, at low bitrates the innovative codebook is not enough populated for modeling efficiently the fine structure of the speech or the noise-like excitation of the unvoiced. Therefore, the perceptual quality is degraded, especially the unvoiced frames which sounds then crispy and unnatural.
For mitigating the coding artifacts at low bitrates, different solutions were already proposed. In G.718[1] and in [2] the codes of the innovative codebook are adaptively and spectrally shaped by enhancing the spectral regions corresponding to the formants of the current frame. The formant positions and shapes can be deducted directly from the LPC coefficients, coefficients already available at both encoder and decoder sides. The formant enhancement of codes c(n) are done by a simple filtering according to:c(n)*fe(n)wherein * denotes the convolution operator and wherein fe(n) is the impulse response of the filter of transfer function:
      Ffe    ⁡          (      z      )        =            A      ⁡              (                              z            /            w                    ⁢                                          ⁢          1                )                    A      ⁡              (                              z            /            w                    ⁢                                          ⁢          2                )            
Where w1 and w2 are the two weighting constants emphasizing more or less the formantic structure of the transfer function Ffe(z). The resulting shaped codes inherit a characteristic of the speech signal and the synthesized signal sounds cleaner.
In CELP it is also usual to add a spectral tilt to the decoder of the innovative codebook. It is done by filtering the codes with the following filter:Ft(z)=1−βz−1 
The factor β is usually related to the voicing of the previous frame and depends, i.e., it varies. The voicing can be estimated from the energy contribution from the adaptive codebook. If the previous frame is voiced, it is expected that the current frame will also be voiced and that the codes should have more energy in the low frequencies, i.e., should show a negative tilt. On the contrary, the added spectral tilt will be positive for unvoiced frames and more energy will be distributed towards high frequencies.
The use of spectral shaping for speech enhancement and noise reduction of the output of the decoder is a usual practice. A so-called formant enhancement as post-filtering consists of an adaptive post-filtering for which the coefficients are derived from the LPC parameters of the decoder. The post-filter looks similar to the one (fe(n)) used for shaping the innovative excitation in certain CELP coders as discussed above. However, in that case, the post-filtering is only applied at the end of the decoder process and not at the encoder side.
In conventional CELP (CELP=(Code)-book excited Linear Prediction), the frequency shape is modeled by the LP (Linear Prediction) synthesis filter, while the time domain shape can be approximated by the excitation gain sent to every subframe although the Long-Term Prediction (LTP) and the innovative codebook are usually not suited for modeling the noise-like excitation of the unvoiced frames. CELP needs a relatively high bitrate for reaching a good quality of the speech unvoiced.
A voiced or unvoiced characterization may be related to segment speech into portions and associated each of them to a different source model of speech. The source models as they are used in CELP speech coding scheme rely on an adaptive harmonic excitation simulating the air flow coming out the glottis and a resonant filter modeling the vocal tract excited by the produced air flow. Such models may provide good results for phonemes like vocals, but may result in incorrect modeling for speech portions that are not generated by the glottis, in particular when the vocal chords are not vibrating such as unvoiced phonemes “s” or “f”.
On the other hand, parametric speech coders are also called vocoders and adopt a single source model for unvoiced frames. It can reach very low bitrates while achieving a so-called synthetic quality being not as natural as the quality delivered by CELP coding schemes at much higher rates.
Thus, there is a need for enhancing audio signals.
An object of the present invention is to increase sound quality at low bitrates and/or reducing bitrates for good sound quality.