A stereo signal includes at least two channels, that is a left channel and a right channel. In addition, stereo signals may also comprise a left and a right surround channel. There is also the possibility that a stereo signal comprises five different channels, that is a front left channel, a front center channel and a front right channel as well as a left back channel and a right back channel.
For a data-reduced coding of stereo signals, there is the possibility that similarities of at least two channels can be made use of to reduce the quantity of bits required to code a stereo signal with at least two channels.
A well-known method for processing stereo signals to obtain an efficient coding is called center/side method (M/S method). In the M/S method, the first channel and the second channel are combined with each other to give a center channel and a side channel. For reasons of clarity, it is not a first and a second channel which are mentioned herein, but a left channel (L channel) and a right channel (R channel). It is known that the center channel equals the sum of the left channel L and the right channel R, multiplied by a factor of 0,5, while the side channel is the difference of the left channel L and the right channel R, multiplied by a factor of, for example, 0,5 (other factors are also possible). Expressed as an equation, this means:M=0,5·(L+R)S=0,5·(L−R)
If the left channel L and the right channel R are relatively similar to each other, an M/S processing brings a considerable saving of the bit quantity required for coding, since the side channel will have relatively less energy than R or L. In the borderline case in which the left channel L and the right channel R are identical, the center channel will equal the left channel L or the right channel R, while the side channel equals 0. It can be seen that, due to the fact that the side channel equals 0, a theoretical maximum bit rate saving when coding of 50% is obtained, since only the center channel has to be coded, while not a single bit has to be devoted to the side channel.
Thus, there is the general rule that the more similar the right and left channels are, the smaller, that is, lower in energy, the side channel will be and the less bits will be required for coding the side channel.
A listener will perceive the similarity of the left and the right channel in that, in the case of identical channels, a speaker or an orchestra are perceived in the very middle between the two loudspeakers. On the other hand, a listener will perceive dissimilar channels in that he has a pronounced stereo effect, that is a speaker, an orchestra or individual instruments of an orchestra can be localized precisely on the left and/or the right. If the case is considered that the left channel comprises a high amount of energy and that the right channel only comprises little energy, that is the case in which, for example, a single instrument is arranged on the very left side in the recording room and is only audible in the left channel while there is solely some noise on the right channel, the center channel, after an M/S processing, will approximately equal the left channel. In addition, the side channel will approximately equal the left channel. In this case, both the center channel and the side channel contain approximately the same amount of energy and both have to be coded by a relatively large number of bits. Compared to the original case, the bit quantity required for coding in this signal constellation has not decreased due to the M/S coding but, in the borderline case, even doubled when it is assumed that the left channel L includes a certain amount of energy, while the right channel R equals 0. In this case, it would have been of considerably more advantage not to perform an M/S processing, but solely an L/R processing. The effects on the number of bits required for coding a stereo signal thus extend in one extreme case from a saving of 50% to, in the other extreme case, a doubling of the bits required for coding. Thus, it has to be checked when the M/S method is applied, whether the item is suitable for an M/S processing or not. In the case in which a stereo signal (for example, a test sector of 20 ms, also called frame) is not suitable for an M/S processing, the M/S processing is dispensed with for reasons of a bit efficiency and both the left and the right channel are individually coded. This “normal” case is also called L/R processing.
Conventional audio coding methods, as are, for example, used to code audio signals which are decoded according to one of the MPEG standards, are generally divided into several steps. At first, an audio signal, for example, present in the form of PCM sample values, as are, for example, output by a CD player, is transformed into a spectral illustration by means of a time-frequency transform or a filter bank. Typically, a block with a certain number of sample values, also called “frame”, is used to generate a block of complex spectral values forming a short-time spectrum of the frame of audio sample values (“samples”). The block formation is obtained using transform windows which are, for example, 1024 sample values long. If, for example, overlapping windows, the overlapping region of which is 50%, are used for transforming, 1024 spectral values are formed of 1024 sample values. These spectral values are then quantized by means of a well-known iteration process, whereupon the quantized spectral values are subjected to an entropy-coding, for example, using a plurality of fixed Huffmann code tables to finally obtain a bit stream which, on the one hand, contains the coded quantized spectral values and which, on the other hand, also comprises side information relating to the windows, to the scale factors calculated when quantizing and to further information required for decoding the bitstream.
A center/side processing can either be performed prior to the transform into the spectral range, that is using the digital time-discrete sample values. Alternatively, a center/side processing can also be performed after the transform, that is using the complex spectral values. The latter alternative, in addition, offers the advantage that a center/side processing cannot be used for the whole spectrum, as is the case in the time region, but also for certain frequency bands when certain spectral values are subjected to a center/side processing and others are not.
Usually, audio coders are designed in such a way that they provide a constant bit rate, that is a certain number of bits per second. Another marginal condition is that the quantizing noise introduced by quantizing is, if possible, selected in such a way that its energy is under the psychoacoustic masking threshold or listening threshold of the audio signal. The fundamental method of setting the quantizing noise in the frequency range consists in “shaping” the noise using the scale factors. For this purpose, the spectrum is divided into several groups of spectral coefficients, as is well-known, which are called scale factor bands, to which any individual scale factor is associated. A scale factor represents a multiplication value used to change the amplitude of all spectral coefficients in this scale factor band. This mechanism is used to set the allocation of the quantizing noise generated by the quantizer in the spectral range in such a way that the energy of the quantizing noise in each scale factor band is under the psychoacoustic masking threshold in this scale factor band. It can be seen that neither the quantizing nor the entropy coding are processes favouring a constant bit rate. On the contrary, it is to be noted that both processes favour a variable bit rate. For transmission applications however, it is often required that the coder comprises a constant bit rate at its output. In order to provide a constant bit rate, a so-called bit reservoir is usually used. If the audio signal is such that temporarily fewer bits than preset by the outer bit rate at the output of the coder are required, bits will be associated to the bit reservoir to be able to give more bits in the case of an audio signal sector requiring more bits for coding, by which the bit reservoir is emptied again.
It is to be noted that a marginal condition of such a coder is, as has been mentioned, the constant output bit rate and that the other marginal condition is that the quantizing noise be smaller than or equal to the psychoacoustic masking threshold, so that it is masked or covered by the audio signal.
In the following, possibilities are dealt with of what has to be done when the “inner bit rate” of the coder differs from the outer constant output bit rate. If the inner bit rate is that low that, for example, the bit reservoir is filled to its maximum value, there is, of course, no problem, since the quantizer can be controlled in such a way that it now quantizes even finer than required, by which more bits are required for quantizing. This is performed until the “outer” constant bit rate is reached.
More critical however, is the case in which the “inner bit rate” of the coder is higher than the constant bit rate required by the output. This case can arise when the audio signal is difficult to code, that is when the coder has to devote many bits to code the audio signal, which, in an illustrative way, can also be called a “high load” of the coder. For the transform coding, there is the maxim that tonal pieces can be coded relatively efficiently, that noisy signals, however, comprising relatively high amounts of energy and, in addition, comprising a relatively complicated spectrum, such as voice or percussion or drum music, can be compressed to a relatively low degree only. Even signals being transient, that is signals comprising an irregular time characteristic, can only be coded in a relatively complicated way when no coding artefacts are to be produced. In the case of transient signals, during windowing, it is switched from large windows to shorter windows to obtain a better time resolution or to obtain that the quantizing noise only “blurs” over a smaller number of audio sample values. In the case of short windows, there is considerably more side information.
A coder which determines that the output bit rate is not sufficient and which has also “emptied” the bit reservoir has several possibilities to reduce its inner bit rate “violently” to meet the criterion of the constant output bit rate. A possibility is to dispense with switching to short windows. This, however, results in audible coding artefacts.
A further possibility is to deliberately impede the psychoacoustic masking threshold when quantizing to quantize in a coarser way than required to obtain a lower bit rate. This also results in audible disturbances.
A further possibility is to lower the audio bandwidth, that is to no longer code the whole audio bandwidth, but to set spectral values above a certain threshold frequency depending on the output bit rate to 0 to reduce the output bit rate. This method does not result in audible quantizing disturbances but leads to a loss in higher frequencies in the audio signal. This loss, however, is often not perceived as strongly as an audible quantizing noise.
A special problem in decoding stereo signals is an effect called “Stereo Unmasking”, which in the following will be explained briefly. If a normal L/R coding is used, both the left channel and the right channel are transformed, quantized and coded individually, so that the quantizing noise introduced into the left channel and the right channel for a data reduction is independent of the respective other channel. This means that the quantizing noise in the left channel and the quantizing noise in the right channel are not correlated. If the case is considered that the left and the right channel are relatively similar to each other, this means that, after decoding, a listener will perceive this signal in such a way that, for example, a speaker is in the center. The “Stereo Unmasking” effect is that, due to the fact that the quantizing noise in the two channels are not correlated, the quantizing noise of the left channel is perceived on the left-hand side and the quantizing noise of the right channel is perceived on the right-hand side. A high masking of the noise, however, only takes place in the center where the useful signal is, but not on the left-hand and the right-hand side.
M/S coding, apart from its data rate reducing effect, also has the advantage in special signals that the quantizing noise in both the left channel and the right channel is correlated with the quantizing noise of the respective other channel, so that the quantizing noise also takes place in the center and, at this place, is basically entirely or significantly better than in the uncorrelated case, respectively, masked by the useful signal. The case in which the left and the right channel are relatively dissimilar is different. If, in this case, an M/S coding is used, the useful signal, due to the stereo effects, will either be on the left-hand side or on the right-hand side, while the quantizing noise is correlated due to the M/S coding and rather in the center. In this case, a stereo unmasking also takes place as it were.
Lately, more and more scalable audio coders are examined. Scalable audio coders are arranged in such a way that their output side bit stream comprises at least a first and a second scaling layer. A decoder which is designed simply takes only the first scaling layer from the scaled bit stream, this layer, for example, comprising a coded audio signal with a reduced bandwidth or an audio signal coded by a simple coding algorithm. Another decoder which is designed fully takes both the first scaling layer and the second scaling layer from the bit stream to decode the first scaling layer by a first decoder and then to decode the second scaling layer as well, the latter, alone or together with the decoded first scaling layer providing an audio signal with a full bandwidth.
Scalable coders are especially desired in the field of stereo signals, since in this case, a mono signal, that is the center channel, can be used as the first scaling layer, while the side channel, for example, can be taken as the second scaling layer. A simple decoder or a decoder designed for a quick operation will only provide the mono signal, while a better decoder or a decoder in which the transmission speed is not the decisive criterion, will take the side layer apart from the mono or center layer to generate a full stereo signal at the output of the decoder.
There are various possibilities for the architecture of the scaling layers. The first scaling layer can differ from the second scaling layer or from any number of further scaling layers in the audio coding method itself, in the audio bandwidth, in the audio quality, relating to mono/stereo or a combination of the named quality criteria or other conceivable criteria. For a high coding efficiency, it is aimed at that the second scaling layer comprises a smallest possible number of bits or that a decoder decoding the second scaling layer also uses the first scaling layer as extensively as possible. When a scalable coder for stereo signals is considered, providing the center signal as a first scaling layer, that is the mono signal, and which, as a second layer, provides the side channel, it can be seen that its overall efficiency is the better, the more often the M/S coding is used. This requirement, however, with certain stereo signals, contradicts the bit efficiency, that is with stereo signals comprising a high stereo channel separation. On the other hand, the M/S processing provides a certain “natural” scalability and results in a correlation of the quantizing noise in the left channel and in the right channel.
The problems mentioned relating to the M/S coding are all the more true, the more an audio signal to be coded suddenly changes its features relating to the M/S coding. If an audio signal to be coded suddenly no longer has the feature that the left channel is similar to the right channel, the M/S coding gain no longer applies. An increase in the quantizing disturbance possibly exceeding the psychoacoustic hearing threshold and/or a reduction of the audio bandwidth depending on the specific implementation of the coder will be the consequences.
This problem becomes especially noticeable in scalable audio coding, but not only, and especially where the so-called mono-stereo-scalability is used, as has been detailed above.