In general, audio coding, and specifically speech coding, performs a mapping from an analog input audio or speech signal to a digital representation in a coding domain and back to analog output audio or speech signal. The digital representation goes along with the quantization or discretization of values or parameters representing the audio or speech. The quantization or discretization can be regarded as perturbing the true values or parameters with coding noise. The art of audio or speech coding is about doing the encoding such that the effect of the coding noise in the decoded speech at a given bit rate is as small as possible. However, the given bit rate at which the speech is encoded defines a theoretical limit down to which the coding noise can be reduced at the best. The goal is at least to make the coding noise as inaudible as possible.
Scalable or embedded coding is a coding paradigm in which the coding is done in layers. The base or core layer encodes the signal at a low bit rate, while additional layers, each on top of each other, provide some enhancement relative to the coding which is achieved with all layers from the core up to the respective previous layer. Each layer adds some additional bit rate. The generated bit stream is embedded, meaning that the bit stream of lower-layer encoding is embedded into bit streams of higher layers. This property makes it possible anywhere in the transmission or in the receiver to drop the bits belonging to higher layers. Such stripped bit stream can still be decoded up to the layer which bits are retained.
A suitable view on the coding noise is to assume it to be some additive white or colored noise. There is a class of enhancement methods which after decoding of the audio or speech signal at the decoder modify the coding noise such that it becomes less audible, which hence results in that the audio or speech quality is improved. Such technology is usually called ‘postfiltering’, which means that the enhanced audio or speech signal is derived in some post processing after the actual decoder. There are many publications on speech enhancement with postfilters. Some of the most fundamental papers are [1]-[4].
Relevant in the context of the invention are pitch or fine-structure postfilters. Their basic working principle is to remove at least parts of the (coding) noise which floods the spectral valleys in between harmonics of voiced speech. This is in general achieved by a weighted superposition of the decoded speech signal with time-shifted versions of it, where the time-shift corresponds to the pitch lag or period of the speech. Preferably, also time-shifted versions into the future speech signal samples are included.
One problem with pitch postfilters which evaluate future speech signals is that they require access to one future pitch period of the decoded audio or speech signal. Making this future signal available for the postfilter is generally possible by buffering the decoded audio or speech signal. In conversational applications of the audio or speech codec this is, however, undesirable since it increases the algorithmic delay of the codec and hence would affect the communication quality and particularly the inter-activity.