Coding systems are known from the state of the art. They can be used for instance for coding audio or video signals for transmission or storage.
FIG. 1 shows the basic structure of an audio coding system, which is employed for transmission of audio signals. The audio coding system comprises an encoder 10 at a transmitting side and a decoder 11 at a receiving side. An audio signal that is to be transmitted is provided to the encoder 10. The encoder is responsible for adapting the incoming audio data rate to a bitrate level at which the bandwidth conditions in the transmission channel are not violated. Ideally, the encoder 10 discards only irrelevant information from the audio signal in this encoding process. The encoded audio signal is then transmitted by the transmitting side of the audio coding system and received at the receiving side of the audio coding system. The decoder 11 at the receiving side reverses the encoding process to obtain a decoded audio signal with little or no audible degradation.
Alternatively, the audio coding system of FIG. 1 could be employed for archiving audio data. In that case, the encoded audio data provided by the encoder 10 is stored in some storage unit, and the decoder 11 decodes audio data retrieved from this storage unit. In this alternative, it is the target that the encoder achieves a bitrate which is as low as possible, in order to save storage space.
Depending on the available bitrate, different coding schemes can be applied to an audio or video signal, the term coding being employed for both, encoding and decoding.
Speech signals have traditionally been coded at low bitrates and sampling rates, since very powerful speech production models exist for speech waveforms, e.g. Linear Prediction (LP) coding models. A good example of a speech coder is an Adaptive Multi-Rate Wideband (AMR-WB) coder. Music signals, on the other hand, have traditionally been coded at relatively high bitrates and sampling rates due to different user expectations. For coding music signals, typically transformation techniques and principles of psychoacoustics are applied. Good examples of music coders are, for example, generic Moving Picture Expert Group (MPEG) Layer III (MP3) and Advanced Audio Coding (AAC) audio coders. Such coders usually employ a Modified Discrete Cosine Transform (MDCT) for transforming received excitation signals into the frequency domain.
In recent years, it has been an aim to develop coding systems which can handle both, speech and music, at competitive bitrates and qualities, e.g. with 20 to 48 kbps and 16 Hz to 24 kHz. It is well-known, however, that speech coders handle music segments quite poorly, whereas generic audio coders are not able to handle speech at low bitrates. Therefore, a combination of two different coding schemes might provide a solution for filling-in the gap between low bitrate speech coders and high bitrate, high quality generic audio coders. The combination of a speech coder and a transform coder is commonly known as hybrid audio coder. A mode switching decision indicating which coder should be used for the current frame is made on a frame-by-frame basis.
In a hybrid coder, it is one of the main challenges to achieve a smooth transition between two enabled coding schemes. Abrupt changes at the frame boundaries when switching from one coder to another should be minimized, since any discontinuity will result in audible degradation at the output signal.
A smooth transition is particularly difficult to achieve when switching from a first coder, e.g. a speech coder, to an MDCT based coder.
MDCT based encoders apply an MDCT to coding frames which overlap by 50% to obtain the spectral representation of the excitation signal. For illustration, FIG. 2 shows four MDCT windows over time samples of an input signal, each MDCT window being associated to another one of consecutive, overlapping coding frames. As can be seen, the overlapping portion of the windows of two consecutive coding frames n, n+1 corresponds to half of the length of a coding frame.
FIG. 3 illustrates how discontinuities are caused when switching from an AMR-WB speech coder to an MDCT coder. Each frame of a signal can be encoded either by an AMR-WB encoder or by an MDCT transform encoder. At the decoder, first an inverse MDCT (IMDCT) is applied to all frames which were encoded by the MDCT based transform encoder, and then the original signal is reconstructed by adding the first half of a current frame to the latter half of the preceding frame. In case a first frame n was encoded by the AMR-WB encoder and the following frame n+1 by the MDCT based transform encoder, discontinuities will be present at the decoder side at frame n+1, since the overlap component from the preceding frame n is missing.
The overlap component is important for the reconstruction, since it contains the original windowed signal and in addition the time aliased version of the windowed signal.
As described by Y. Wang, M. Vilermo, et. al. in “Restructured audio encoder for improved computational efficiency”, 108th AES Convention, Paris 2000, Preprint 5103, the MDCT works such that a signal sequence of 2N samples contains the following components: Between 0 and N−1 time samples of the original windowed signal plus the mirrored and inverted original windowed signal; between N and 2N−1 time samples of the original windowed signal plus the mirrored original windowed signal. The mirrored components are time aliases and will be canceled in the overlap-add operation.
In case the overlap component from the preceding frame is missing, the alias term cannot be canceled from the current frame n+1. This will result in audible degradation at the output signal.
In document “High-level description for the ITU-T wideband (7 kHz) ATCELP speech coding algorithm of Deutche Telekom, Aachen University of Technology (RWTH) and France Telekom (CNET)”, ITU-T SQ16 delayed contribution D.130, February 1998, by Deutsche Telekom and France Telekom, it is, proposed to use a special transition window and an extrapolation when switching from a Code Excited Linear Prediction (CELP) coder to an Adaptive Transform Coder (ATC). The transition window enables the ATC to decode the last samples of a frame. The first samples are obtained by extrapolating the samples from the previous frames via an LP-filter. Such an extrapolation, however, might introduce discontinuities and artifacts especially in the case where the frame boundaries are at the onset of a transient signal segment.