1. Field of the Invention
The present invention relates to a method and device for post-processing a decoded sound signal in view of enhancing a perceived quality of this decoded sound signal.
This post-processing method and device can be applied, in particular but not exclusively, to digital encoding of sound (including speech) signals. For example, this post-processing method and device can also be applied to the more general case of signal enhancement where the noise source can be from any medium or system, not necessarily related to encoding or quantization noise.
2. Brief Description of the Current Technology
2.1 Speech Encoders
Speech encoders are widely used in digital communication systems to efficiently transmit and/or store speech signals. In digital systems, the analog input speech signal is first sampled at an appropriate sampling rate, and the successive speech samples are further processed in the digital domain. In particular, a speech encoder receives the speech samples as an input, and generates a compressed output bit stream to be transmitted through a channel or stored on an appropriate storage medium. At the receiver, a speech decoder receives the bit stream as an input, and produces an output reconstructed speech signal.
To be useful, a speech encoder must produce a compressed bit stream with a bit rate lower than the bit rate of the digital, sampled input speech signal. State-of-the-art speech encoders typically achieve a compression ratio of at least 16 to 1 and still enable the decoding of high quality speech. Many of these state-of-the-art speech encoders are based on the CELP (Code-Excited Linear Predictive) model, with different variants depending on the algorithm.
In CELP encoding, the digital speech signal is processed in successive blocks of speech samples called frames. For each frame, the encoder extracts from the digital speech samples a number of parameters that are digitally encoded, and then transmitted and/or stored. The decoder is designed to process the received parameters to reconstruct, or synthesize the given frame of speech signal. Typically, the following parameters are extracted from the digital speech samples by a CELP encoder:                Linear Prediction Coefficients (LP coefficients), transmitted in a transformed domain such as the Line Spectral Frequencies (LSF) or Immitance Spectral Frequencies (ISF);        Pitch parameters, including a pitch delay (or lag) and a pitch gain; and        Innovative excitation parameters (fixed codebook index and gain).The pitch parameters and the innovative excitation parameters together describe what is called the excitation signal. This excitation signal is supplied as an input to a Linear Prediction (LP) filter described by the LP coefficients. The LP filter can be viewed as a model of the vocal tract, whereas the excitation signal can be viewed as the output of the glottis. The LP or LSF coefficients are typically calculated and transmitted every frame, whereas the pitch and innovative excitation parameters are calculated and transmitted several times per frame. More specifically, each frame is divided into several signal blocks called subframes, and the pitch parameters and the innovative excitation parameters are calculated and transmitted every subframe. A frame typically has a duration of 10 to 30 milliseconds, whereas a subframe typically has a duration of 5 milliseconds.        
Several speech encoding standards are based on the Algebraic CELP (ACELP) model, and more precisely on the ACELP algorithm. One of the main features of ACELP is the use of algebraic codebooks to encode the innovative excitation at each subframe. An algebraic codebook divides a subframe in a set of tracks of interleaved pulse positions. Only a few non-zero-amplitude pulses per track are allowed, and each non-zero-amplitude pulse is restricted to the positions of the corresponding track. The encoder uses fast search algorithms to find the optimal pulse positions and amplitudes for the pulses of each subframe. A description of the ACELP algorithm can be found in the article of R. SALAMI et al., “Design and description of CS-ACELP: a toll quality 8 kb/s speech coder” IEEE Trans. on Speech and Audio Proc., Vol. 6, No. 2, pp. 116-130, March 1998, herein incorporated be reference, and which describes the ITU-T G.729 CS-ACELP narrowband speech encoding algorithm at 8 kbits/second. It should be noted that there are several variations of the ACELP innovation codebook search, depending on the standard of concern. The present invention is not dependent on these variations, since it only applies to post-processing of the decoded (synthesized) speech signal.
A recent standard based on the ACELP algorithm is the ETSI/3GPP AMR-WB speech encoding algorithm, which was also adopted by the ITU-T (Telecommunication Standardization Sector of ITU (International Telecommunication Union)) as recommendation G.722.2 . [ITU-T Recommendation G.722.2 “Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)” Geneva, 2002], [3GPP TS 26.190, “AMR Wideband Speech Codec: Transcoding Functions,” 3GPP Technical Specification]. The AMR-WB is a multi-rate algorithm designed to operate at nine different bit rates between 6.6 and 23.85 kbits/second. Those of ordinary skill in the art know that the quality of the decoded speech generally increases with the bit rate. The AMR-WB has been designed to allow cellular communication systems to reduce the bit rate of the speech encoder in the case of bad channel conditions; the bits are converted to channel encoding bits to increase the protection of the transmitted bits. In this manner, the overall quality of the transmitted bits can be kept higher than in the case where the speech encoder operates at a single fixed bit rate.
FIG. 7 is a schematic block diagram showing the principle of the AMR-WB decoder. More specifically, FIG. 7 is a high-level representation of the decoder, emphasizing the fact that the received bitstream encodes the speech signal only up to 6.4 kHz (12.8 kHz sampling frequency), and the frequencies higher than 6.4 kHz are synthesized at the decoder from the lower-band parameters. This implies that, in the encoder, the original wideband, 16 kHz-sampled speech signal was first down-sampled to the 12.8 kHz sampling frequency, using multi-rate conversion techniques well known to those of ordinary skill in the art. The parameter decoder 701 and the speech decoder 702 of FIG. 7 are analogous to the parameter decoder 106 and the source decoder 107 of FIG. 1. The received bitstream 709 is first decoded by the parameter decoder 701 to recover parameters 710 supplied to the speech decoder 702 to resynthesize the speech signal. In the specific case of the AMR-WB decoder, these parameters are:                ISF coefficients for every frame of 20 milliseconds;        An integer pitch delay T0, a fractional pitch value T0_frac around T0, and a pitch gain for every 5 millisecond subframe; and        An algebraic codebook shape (pulse positions and signs) and gain for every 5 millisecond subframe.From the parameters 710, the speech decoder 702 is designed to synthesize a given frame of speech signal for the frequencies equal to and lower than 6.4 kHz, and thereby produce a low-band synthesized speech signal 712 at the 12.8 kHz sampling frequency. To recover the full-band signal corresponding to the 16 kHz sampling frequency, the AMR-WB decoder comprises a high-band resynthesis processor 707 responsive to the decoded parameters 710 from the parameter decoder 701 to resynthesize a high-band signal 711 at the sampling frequency of 16 kHz. The details of the high-band signal resynthesis processor 707 can be found in the following publications which are herein incorporated by reference:        ITU-T Recommendation G. 722.2 “Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)”, Geneva, 2002; and        3GPP TS 26.190, “AMR Wideband Speech Codec: Transcoding Functions,” 3GPP Technical Specification. The output of the high-band resynthesis processor 707, referred to as the high-band signal 711 of FIG. 7, is a signal at the 16 kHz sampling frequency, having an energy concentrated above 6.4 kHz. The processor 708 sums the high-band signal 711 to a 16-kHz up-sampled low-band speech signal 713 to form the complete decoded speech signal 714 of the AMR-WB decoder at the 16 kHz sampling frequency.2.2 Need for Post-Processing        
Whenever a speech encoder is used in a communication system, the synthesized or decoded speech signal is never identical to the original speech signal even in the absence of transmission errors. The higher the compression ratio, the higher the distortion introduced by the encoder. This distortion can be made subjectively small using different approaches. A first approach is to condition the signal at the encoder to better describe, or encode, subjectively relevant information in the speech signal. The use of a formant weighting filter, often represented as W(z), is a widely used example of this first approach [B. Kleijn and K. Paliwal editors, <<Speech Coding and Synthesis, >> Elsevier, 1995]. This filter W(z) is typically made adaptive, and is computed in such a way that it reduces the signal energy near the spectral formants, thereby increasing the relative energy of lower energy bands. The encoder can then better quantize lower energy bands, which would otherwise be masked by encoding noise, increasing the perceived distortion. Another example of signal conditioning at the encoder is the so-called pitch sharpening filter which enhances the harmonic structure of the excitation signal at the encoder. Pitch sharpening aims at ensuring that the inter-harmonic noise level is kept low enough in the perceptual sense.
A second approach to minimize the perceived distortion introduced by a speech encoder is to apply a so-called post-processing algorithm. Post-processing is applied at the decoder, as shown in FIG. 1. In FIG. 1, the speech encoder 101 and the speech decoder 105 are broken down in two modules. In the case of the speech encoder 101, a source encoder 102 produces a series of speech encoding parameters 109 to be transmitted or stored. These parameters 109 are then binary encoded by the parameter encoder 103 using a specific encoding method, depending on the speech encoding algorithm and on the parameters to encode. The encoded speech signal (binary encoded parameters) 110 is then transmitted to the decoder through a communication channel 104. At the decoder, the received bit stream 111 is first analysed by a parameter decoder 106 to decode the received, encoded sound signal encoding parameters, which are then used by the source decoder 107 to generate the synthesized speech signal 112. The aim of post-processing (see post-processor 108 of FIG. 1) is to enhance the perceptually relevant information in the synthesized speech signal, or equivalently to reduce or remove the perceptually annoying information. Two commonly used forms of post-processing are formant post-processing and pitch post-processing. In the first case, the formant structure of the synthesized speech signal is amplified by the use of an adaptive filter with a frequency response correlated to the speech formants. The spectral peaks of the synthesized speech signal are then accentuated at the expense of spectral valleys whose relative energy becomes smaller. In the case of pitch post-processing, an adaptive filter is also applied to the synthesized speech signal. However in this case, the filter's frequency response is correlated to the fine spectral structure, namely the harmonics. A pitch post-filter then accentuates the harmonics at the expense of inter-harmonic energy which becomes relatively smaller. Note that the frequency response of a pitch post-filter typically covers the whole frequency range. The impact is that a harmonic structure is imposed on the post-processed speech even in frequency bands that did not exhibit a harmonic structure in the decoded speech. This is not a perceptually optimal approach for wideband speech (speech sampled at 16 kHz), which rarely exhibits a periodic structure on the whole frequency range.