General purpose perceptual audio coders achieve relatively high coding gains by using transforms such as the Modified Discrete Cosine Transform (MDCT) with block sizes of samples which cover several tenths of milliseconds (e.g. 20 ms). An example for such a transform-based audio codec system is Advanced Audio Coding (AAC) or High Efficiency (HE)-AAC. However, when using such transform-based audio codec systems for voice signals, the quality of voice signals degrades faster than that of musical signals towards lower bitrates, especially in the case of dry (non-reverberant) speech signals. Hence, transform-based audio codec systems are not inherently well suited for the coding of voice signals or for the coding of audio signals comprising a voice component. In other words, transform-based audio codec systems exhibit an asymmetry with regards to the coding gain achieved for musical signals compared to the coding gain achieved for voice signals. This asymmetry may be addressed by providing add-ons to transform-based coding, wherein the add-ons aim at an improved spectral shaping or signal matching. Examples for such add-ons are pre/post shaping, Temporal Noise Shaping (TNS) and Time Warped MDCT. Furthermore, this asymmetry may be addressed by the incorporation of a classical time domain speech coder based on short term prediction filtering (LPC) and long term prediction (LTP).
It can be shown that the improvements obtained by providing add-ons to transform-based coding are typically not sufficient to even out the performance gap between the coding of music signals and speech signals. On the other hand, the incorporation of a classical time domain speech coder fills the performance gap, however, to the extent that the performance asymmetry is reversed to the opposite direction. This is due to the fact that classical time domain speech coders model the human speech production system and have been optimized for the coding of speech signals.
In view of the above, a transform-based audio codec may be used in combination with a classical time domain speech codec, wherein the classical time domain speech codec is used for speech segments of an audio signal and wherein the transform-based codec is used for the remaining segments of the audio signal. However, the coexistence of a time domain and a transform domain codec in a single audio codec system requires reliable tools for switching between the different codecs, based on the properties of the audio signal. In addition, the actual switching between a time domain codec (for speech content) and a transform domain codec (for the remaining content) may be difficult to implement. In particular, it may be difficult to ensure a smooth transition between the time domain codec and the transform domain codec (and vice versa). Furthermore, modifications to the time-domain codec may be required in order to make the time-domain codec more robust for the unavoidable occasional encoding of non-speech signals, for example for the encoding of a singing voice with instrumental background.
The present document addresses the above mentioned technical problems of audio codec systems. In particular, the present document describes an audio codec system which translates only the critical features of a speech codec and thereby achieves an even performance for speech and music, while staying within the transform-based codec architecture. In other words, the present document describes a transform-based audio codec which is particularly well suited for the encoding of speech or voice signals.