In a communication network transmitting speech or audio, the original speech 100 or audio is encoded by an encoder 101 at the transmitter and an encoded bitstream 102 is transmitted to the receiver as illustrated by FIG. 3. At the receiver, the encoded bitstream 102 is decoded by a decoder 103 that reconstructs the original speech and audio signal into a reconstructed speech (or audio) 104 signal. Speech and audio coding introduces quantization noise that impairs the quality of the reconstructed speech. Therefore postfilter algorithms 105 are introduced. The state-of the art postfilter algorithms 105 shape the quantization noise such that it becomes less audible. Thus the existing postfilters improve the perceived quality of the speech signal reconstructed by the decoder such that an enhanced speech signal 106 is provided. An overview of postfilter techniques can be found in J. H. Chen and A. Gersho, “Adaptive postfiltering for quality enhancement of coded speech”, IEEE Trans. Speech Audio Process, vol. 3, pp. 58-71, 1985.
All existing postfilters exploit the concept of signal masking. It is an important phenomenon in human auditory system. It means that a sound is inaudible in the presence of a stronger sound. In general the masking threshold has a peak at the frequency of the tone, and monotonically decreases on both sides of the peak. This means that the noise components near the tone frequency (speech formants) are allowed to have higher intensities than other noise components that are farther away (spectrum valleys). That is why existing postfilters adapt on a frame-basis to the formant and/or pitch structures in the speech, in the form of autoregressive (AR) coefficients and/or pitch period.
The most popular postfilters are the formant (short-term) postfilter and pitch (long-term) postfilter. A formant postfilter reduces the effect of quantization noise by emphasizing the formant frequencies and deemphasizing the spectral valleys. This is illustrated in FIG. 1, where the continuous line shows an autoregressive envelope of a signal before postfiltering and the dashed line shows an autoregressive envelope of a signal after postfiltering. The pitch postfilter emphasizes frequency components at pitch harmonic peaks, which is illustrated in FIG. 2. The continuous line of FIG. 2 shows the spectrum of a signal before postfiltering while the dashed line shows the spectrum of a signal after postfiltering. The plots of FIGS. 1 and 2 concern 30 ms blocks from a narrowband signal. It should also be noted that the plots of FIGS. 1 and 2 do not represent the actual postfilter parameters, but just the concept of postfiltering.
The formants and/or the pitch indicate(s) how the energy is distributed in one frame which implies that the parts of the signal that are masked (that are less audible or completely audible) are indicated. Hence, the existing postfilter parameter adaptation exploits the signal-masking concept, and therefore adapt to the speech structures like formant frequencies and pitch harmonic peaks. These are all in-frame features (such as pitch period giving pitch harmonic peaks and autoregressive coefficients determining formants), calculated under the assumption that speech is stationary for the current frame (e.g., 20 ms speech).
In addition to signal masking, an important psychoacoustical phenomenon is that if the signal dynamics are high, then distortion is less objectionable. It means that noise is aurally masked by rapid changes in the speech signal. This concept of aurally masking the noise by rapid changes in the speech signal is already in use for speech coding in H. Knagenhjelm and W. B. Kleijn, “Spectral dynamics is more important than spectral distortion”, ICASSP, vol. 1, pp. 732-735, 1995 and for enhancement in T. Quateri and R. Dunn, “Speech enhancement based on auditory spectral change”, ICASSP, vol. 1, pp. 257-260, 2002. In H. Knagenhjelm and W. B. Kleijn adaptation to spectral dynamics is used in line spectral frequencies (LSF) quantization. In T. Quateri and R. Dunn adaptation to spectral dynamics is used in a pre-processor for background noise attenuation.