As speech and communication devices have become ubiquitous and are likely to be used in adverse conditions, the demand for speech enhancement methods which can cope with adverse environments has increased. Consequently, for example, in mobile phones it is by now common to use noise attenuation methods as a pre-processing block/step for all subsequent speech processing such as speech coding. There exist various approaches which incorporate speech enhancement into speech coders [1, 2, 3, 4]. While such designs do improve transmitted speech quality, cascaded processing does not allow a joint perceptual optimization/minimization of quality, or a joint minimization of quantization noise and interference has at least been difficult.
The goal of speech codecs is to allow transmission of high quality speech with a minimum amount of transmitted data. To reach this goal an efficient representations of the signal is needed, such as modelling of the spectral envelope of the speech signal by linear prediction, the fundamental frequency by a long-time predictor and the remainder with a noise codebook. This representation is the basis of speech codecs using the code excited linear prediction (CELP) paradigm, which is used in major speech coding standards such as Adaptive Multi-Rate (AMR), AMR-Wide-Band (AMR-WB), Unified Speech and Audio Coding (USAC) and Enhanced Voice Service (EVS) [5, 6, 7, 8, 9, 10, 11].
For natural speech communication, speakers often use devices in hands-free modes. In such scenarios the microphone is usually far from the mouth, whereby the speech signal can easily become distorted by interferences such as reverberation or background noise. The degradation does not only affect the perceived speech quality, but also the intelligibility of the speech signal and can therefore severely impede the naturalness of the conversation. To improve the communication experience, it is then beneficial to apply speech enhancement methods to attenuate noise and reduce the effects of reverberation. The field of speech enhancement is mature and plenty of methods are readily available [12]. However, a majority of existing algorithms are based on overlap-add methods, such as transforms like the short-time Fourier transform (STFT), that apply overlap-add based windowing schemes, whereas in contrast, CELP codecs model the signal with a linear predictor/linear predictive filter and apply windowing only on the residual. Such fundamental differences make it difficult to merge enhancement and coding methods. Yet it is clear that joint optimization of enhancement and coding can potentially improve quality, reduce delay and computational complexity.
Therefore, there is a need for an improved approach.