1. Field of the Invention
The present invention relates to audio signal processing and particularly to multi-channel processing techniques based on generating a multi-channel reconstruction of an original multi-channel signal on the basis of at least one base channel and/or downmix channel and multi-channel additional information.
2. Description of the Related Art
Technologies currently in development allow ever more efficient transmission of audio signals by data reduction, but also an increase of the listening pleasure by extensions, such as by the use of multi-channel technology. Examples for such an extension of the common transmission techniques have recently become known under the name of binaural cue coding (BCC) and “Spatial Audio Coding”, as described in J. Herre, C. Faller, S. Disch, C. Ertel, J. Hilbert, A. Hoelzer, K. Linzmeier, C. Sprenger, P. Kroon: “Spatial Audio Coding: Next Generation Efficient and Compatible Coding of Multi-Channel Audio”, 117th AES Convention, San Francisco 2004, Preprint 6186.
The following will discuss various techniques for reducing the data amount needed for the transmission of a multi-channel audio signal in more detail.
Such techniques are called joint stereo techniques. For this purpose, see FIG. 3 showing a joint stereo device 60. This device may be a device implementing, for example, the intensity stereo (IS) technique or the binaural cue coding technique (BCC). Such a device usually receives at least two channels CH1, CH2, . . . CHn as input signal and outputs a single carrier channel and parametric multi-channel information. The parametric data are defined so that an approximation of an original channel (CH1, CH2, . . . CHn) may be calculated in a decoder.
Normally, the carrier channel will include subband samples, spectral coefficients, time domain samples, etc., which provide a relatively fine representation of the underlying signal, while the parametric data do not include any such samples or spectral coefficients, but control parameters for controlling a determined reconstruction algorithm, such as weighting by multiplying, by time shifting, by frequency shifting, etc. The parametric multi-channel information thus includes a relatively rough representation of the signal or the associated channel. Expressed in numbers, the amount of data needed by a carrier channel is an amount of about 60 to 70 kbit/s, while the amount of data needed by parametric side information for a channel is in the range from 1.5 to 2.5 kbit/s. It is to be noted that the above numbers apply to compressed data. Of course, an uncompressed CD channel necessitates data rates in the order of about 10 times as much. An example of parametric data are the known scale factors, intensity stereo information or BCC parameters, as will be described below.
The technique of intensity stereo coding is described in the AES preprint 3799 “Intensity Stereo Coding”, J. Herre, K. H. Brandenburg, D. Lederer, February 1994, Amsterdam. In general, the concept of intensity stereo is based on a main axis transform which is to be performed on data of both stereophonic audio channels. If most data points are concentrated around the first main axis, a coding gain may be achieved by rotating both signals by a determined angle prior to the coding. However, this does not apply to real stereophonic reproduction techniques. Thus this technique is modified in that the second orthogonal component is excluded from the transmission in the bit stream. Thus the reconstructed signals for the left and the right channel consist of differently weighted or scaled versions of the same transmitted signal. Nevertheless, the reconstructed signals differ in amplitude, but they are identical with respect to their phase information. The energy-time envelopes of both original audio channels, however, are maintained by means of the selective scaling operation typically operating in a frequency-selective fashion. This corresponds to the human perception of sound at high frequencies, where the dominant spatial information is determined by the energy envelopes.
In addition, in practical implementations the transmitted signal, i.e. the carrier channel, is generated from the sum signal of the left channel and the right channel instead of the rotation of both components. Furthermore, this processing, i.e. the generation of intensity stereo parameters for performing the scaling operations, is performed in a frequency-selective way, i.e. independently for each scale factor band, i.e. for each encoder frequency partition. Advantageously, both channels are combined to form a combined or “carrier” channel and the intensity stereo information in addition to the combined channel. The intensity stereo information depends on the energy of the first channel, the energy of the second channel or the energy of the combined channel.
The BCC technique is described in the AES convention paper 5574 “Binaural Cue Coding applied to stereo and multi-channel audio compression”, T. Faller, F. Baumgarte, May 2002, Munich. In BCC coding, a number of audio input channels is converted to a spectral representation, namely using a DFT-based transform with overlapping windows. The resulting spectrum is divided into non-overlapping portions, each of which has an index. Each partition has a bandwidth proportional to the equivalent rectangular bandwidth (ERB). The inter-channel level differences (ICLD) and the inter-channel time differences (ICTD) are determined for each partition and for each frame k. The ICLD and ICTD are quantized and coded to finally get into a BCC bit stream as side information. The inter-channel level differences and the inter-channel time differences are given for each channel relative to a reference channel. Then the parameters are calculated according to predetermined formulae depending on the particular partitions of the signal to be processed.
On the decoder side, the decoder normally receives a mono signal and the BCC bit stream. The mono signal is transformed to the frequency domain and input into a spatial synthesis block also receiving decoded ICLD and ICTD values. In the spatial synthesis block, the BCC parameters (ICLD and ICTD) are used to perform a weighting operation of the mono signal to synthesize the multi-channel signals which, after a frequency/time conversion, represent a reconstruction of the original multi-channel audio signal.
In the case of BCC, the joint stereo module 60 operates to output the channel side information so that the parametric channel data are quantized and coded ICLD or ICTD parameters, wherein one of the original channels is used as reference channel for coding the channel side information.
Normally, the carrier signal is formed of the sum of the participating original channels.
Of course, the above techniques only provide a mono representation for a decoder which is only able to process the carrier channel, but which is not capable of processing the parametric data for generating one or more approximations of more than one input channel.
The BCC technique is also described in the US patent publications US 2003/0219130 A1, US 2003/0026441 A1 and US 2003/0035553 A1. In addition, see the specialist publication “Binaural Cue Coding. Part II: Schemes and Applications”, T. Faller and F. Baumgarte, IEEE Trans. On Audio and Speech Proc., vol. 11, no. 6, November 2003.
In the following, a typical BCC scheme for multi-channel audio coding will be presented in more detail with reference to FIGS. 4 to 6.
FIG. 5 shows such a BCC scheme for coding/transmission of multi-channel audio signals. The multi-channel audio input signal at an input 110 of a BCC encoder 112 is mixed down in a so called downmix block 114. In this example, the original multi-channel signal at the input 110 is a 5 channel surround signal having a front left channel, a front right channel, a left surround channel, a right surround channel, and a center channel. In the embodiment of the present invention, the downmix block 114 generates a sum signal by simple addition of these five channels into a mono signal.
Other downmixing schemes are known in the art, so that a downmix channel with a single channel is obtained using a multi-channel input signal.
This single channel is output on a sum signal line 115. Side information obtained by the BCC analysis block 116 is output on a side information line 117.
In the BCC analysis block, inter-channel level differences (ICLD) and inter-channel time differences (ICTD) are calculated as described above. Recently, the BCC analysis block 116 has also become capable of calculating inter-channel correlation values (ICC values). The sum signal and the side information are transmitted to a BCC decoder 120 in a quantized and coded format. The BCC decoder splits the transmitted sum signal into a number of subbands and performs scalings, delays and other processing steps to provide the subbands of the multi-channel audio channels to be output. This processing is performed so that the ICLD, ICTD and ICC parameters (cues) of a reconstructed multi-channel signal at output 121 match the corresponding cues for the original multi-channel signal at input 110 in the BCC encoder 112. For this purpose, the BCC decoder 120 includes a BCC synthesis block 122 and a side information processing block 123.
The following will illustrate the internal structure of the BCC synthesis block 122 with respect to FIG. 6. The sum signal on the line 115 is fed to a time/frequency conversion unit or filter bank FB 125. At the output of block 125, there is a number N of subband signals or, in an extreme case, a block of spectral coefficients, if the audio filter bank 125 performs a 1:1 transform, i.e. a transform generating N spectral coefficients from N time domain samples.
The BCC synthesis block 122 further includes a delay stage 126, a level modification stage 127, a correlation processing stage 128, and an inverse filter bank stage IFB 129. At the output of stage 129, the reconstructed multi-channel audio signal having, for example, five channels in the case of a 5 channel surround system may be output to a set of loudspeakers 124, as illustrated in FIG. 5 or FIG. 4.
The input signal sn is converted to the frequency domain or the filter bank domain by means of element 125. The signal output by element 125 is copied such that several versions of the same signal are obtained, as illustrated by the copy node 130. The number of versions of the original signal is equal to the number of output channels in the output signal. Then each version of the original signal is subjected to a determined delay d1, d2, . . . , di, . . . dN at the node 130. The delay parameters are calculated by the side information processing block 123 in FIG. 5 and derived from the inter-channel time differences as they were calculated by the BCC analysis block 116 of FIG. 5.
The same applies to the multiplication parameters a1, a2, . . . ai, . . . , aN, which are also calculated by the side information processing block 123 based on the inter-channel level differences as calculated by the BCC analysis block 116.
The ICC parameters calculated by the BCC analysis block 116 are used for controlling the functionality of block 128 so that determined correlations between the delayed and level-manipulated signals are obtained at the outputs of block 128. It is to be noted that the order of the stages 126, 127, 128 may be different from the order shown in FIG. 6.
It is to be noted that, in a framewise processing of the audio signal, the BCC analysis is also performed framewise, i.e. variable in time, and that there is further obtained a frequency-wise BCC analysis, as apparent by the filter bank division of FIG. 6. This means that the BCC parameters are obtained for each spectral band. This means further that, in the case in which the audio filter bank 126 splits the input signal into, for example, 32 bandpass signals, the BCC analysis block obtains a set of BCC parameters for each of the 32 bands. Of course, the BCC synthesis block 122 of FIG. 5, illustrated in detail in FIG. 6, performs a reconstruction also based on the 32 bands given by way of example.
With reference to FIG. 4, the following will present a scenario used to determine individual BCC parameters. Normally, the ICLD, ICTD and ICC parameters may be defined between channel pairs. However, it is advantageous to determine the ICLD and ICTD parameters between a reference channel and each other channel. This is illustrated in FIG. 4A.
ICC parameters may be defined in various ways. Generally speaking, ICC parameters may be determined in the encoder between any channel pairs, as illustrated in FIG. 4B. However, there has been the suggestion to calculate only ICC parameters between the strongest two channels at one time, as illustrated in FIG. 4C, which shows an example in which, at one time, an ICC parameter between the channels 1 and 2 is calculated, and at another time, an ICC parameter between the channels 1 and 5 is calculated. The decoder then synthesizes the inter-channel correlation between the strongest channels in the decoder and uses certain heuristic rules for calculating and synthesizing the inter-channel coherence for the remaining channel pairs.
With respect to the calculation of, for example, the multiplication parameters a1, aN based on the transmitted ICLD parameters, reference is made to the AES convention paper no. 5574. The ICLD parameters represent an energy distribution of an original multi-channel signal. Without loss of generality, it is advantageous, as shown in FIG. 4A, to take four ICLD parameters representing the energy difference between the respective channels and the front left channel. In the side information processing block 122, the multiplication parameters a1, . . . , aN are derived from the ICLD parameters so that the total energy of all reconstructed output channels is the same (or proportional to the energy of the transmitted sum signal).
Generally, a generation of at least one base channel and the side information takes place in such particularly parametric multi-channel coding schemes, as apparent from FIG. 5. Typically, block-based schemes are used in which, as also apparent from FIG. 5, the original multi-channel signal at input 110 is subjected to a block processing by a block stage 111 such that the downmix signal and/or sum signal and/or the at least one base channel for this block is formed from a block of, for example, 1152 samples, while, at the same time, the corresponding multi-channel parameters are generated for this block by the BCC analysis. After the downmix channel, the sum signal is typically coded again with a block-based encoder, such as an MP3 encoder or an AAC encoder, to obtain a further data rate reduction. Likewise, the parameter data are coded, for example by difference coding, scaling/quantizing and entropy coding. Generally, the fingerprint generator is formed to perform a quantization and entropy coding of fingerprint values to obtain the fingerprint information.
Then, at the output of the entire encoder, including the BCC encoder 112 and a downstream base channel encoder, a common data stream is written in which a block of the at least one base channel follows a previous block of the at least one base channel, and in which the coded multi-channel additional information are also inserted, for example by a bit stream multiplexer.
This insertion is done so that the data stream of base channel data and multi-channel additional information includes a block of base channel data and includes a block of multi-channel additional data in association with this block, which then form, for example, a common transmission frame. This transmission frame is then sent to a decoder via a transmission path.
On the input side, the decoder again includes a data stream demultiplexer to split a frame of the data stream into a block of base channel data and a block of associated multi-channel additional information. Then the block of base data is decoded, for example by an MP3 decoder or an AAC decoder. This block of decoded base data is then supplied to the BCC decoder 102 together with the block of multi-channel additional information, which may also be decoded.
In that way, the time association of the additional information with the base channel data is set automatically due to the common transmission of base channel data and additional information and may readily be recovered by a decoder operating in a framewise fashion. The decoder thus automatically finds, as it were, the additional information associated with a block of base channel data due to the common transmission of the two data types in a single data stream so that a high quality multi-channel reconstruction is possible. Thus, there will no problem that the multi-channel additional information have a time offset with respect to the base channel data. If, however, there was such an offset, this would result in a significant quality loss of the multi-channel reconstruction, because in that case a block of base channel data is processed together with multi-channel additional data, although these multi-channel additional data do not belong to the block of base data, but, for example, to a previous or later block.
Such a scenario in which the association between multi-channel additional data and base channel data is no longer given will occur when no common data stream is written, but when there is a distinct data stream with the base channel data and there is another data stream separate therefrom with the multi-channel additional information. Such a situation may occur, for example, in a transmission system operating sequentially, such as radio or internet. Here, the audio program to be transmitted is divided into audio base data (mono or stereo downmix audio signal) and extension data (multi-channel additional information) which are emitted individually or in a combined fashion. Even if the two data streams are sent out by a transmitter still synchronous in time, a lot of “surprises” may be lurking on the transmission path to the receiver which result in the data stream with the multi-channel additional data, which is substantially more compact with respect to the number of bits, being transmitted, for example, faster to a receiver than the data stream with the base channel data.
Furthermore, it is advantageous to use encoders/decoders with non-constant output data rate to achieve a particularly good bit efficiency. Here, it cannot be predicted how long the decoding of a block of base channel data will take. Furthermore, this processing also depends on the actually used hardware components for decoding, as they have to be present, for example, in a PC or digital receiver. Furthermore, there are also system and/or algorithmic inherent blurrings, because, particularly in the bit reservoir technique, a constant output data rate is generated on the average, but, locally speaking, bits not needed for a particularly well codable block are saved to be withdrawn from the bit reservoir for another block that is particularly difficult to code, because the audio signal is, for example, particularly transient.
On the other hand, the separation of the above described common data stream into two individual data streams has special advantages. For example, a classic receiver, i.e. for example a pure mono or stereo receiver, is capable of receiving and reproducing the audio base data at any time independent of content and version of the multi-channel additional information. The division into separate data streams thus ensures the backward compatibility of the whole concept.
In contrast, a receiver of the newer generation may evaluate these multi-channel additional data and combine them with the audio base data so that the complete extension, here the multi-channel sound, is provided to the user.
A particularly interesting application scenario of the separate transmission of audio base data and extension data exists in digital radio. Here, the multi-channel additional information helps to extend the stereo audio signal emitted up to now to a multi-channel format, such as 5.1, by little additional transmission effort. Here, the program provider generates the multi-channel additional information on the transmitter side from multi-channel sound sources, as they are to be found, for example, on DVD audio/video. Subsequently, this multi-channel additional information is transmitted in parallel to the audio stereo signal emitted as usual, which, however, now is not simply a stereo signal, but includes two base channels that have been derived from the multi-channel signal by some downmix. For the listener, however, the stereo signal of the two base channels sounds like a usual stereo signal, because, in the multi-channel analysis, there are finally taken steps similar to those having been taken by a sound master that mixed a stereo signal from several tracks.
A great advantage of the separation consists in the compatibility with the already existing digital radio transmission systems. A classic receiver that is not able to evaluate this additional information will be able to receive and reproduce the two-channel sound signal as usual without any qualitative restrictions. A receiver of newer design, however, may evaluate this multi-channel information in addition to the stereo sound signal previously received, decode it and reconstruct the original 5.1 multi-channel signal therefrom.
In order to allow the simultaneous transmission of the multi-channel additional information as a supplement to the stereo signal previously used, it is possible, as already mentioned, to combine the multi-channel additional information with the coded downmix audio signal for a digital radio system, i.e. that there is a single data stream which is then scalable, if necessary, and may also be read by an existing receiver which, however, ignores the additional data with respect to the multi-channel additional information.
The receiver thus also only sees a (valid) audio data stream and, if it is a receiver of newer design, may further extract the multi-channel sound additional information from the data stream via a corresponding upstream data distributor again synchronously to the associated audio data block, decode it and output it as 5.1 multi-channel sound.
The disadvantage of this approach, however, is the extension of the existing infrastructure and/or the existing data paths so that they may transport the data signals combined of downmix signals and extension instead of only the stereo audio signals as previously. So, if we leave the standard transmission format for stereo data, the synchronism may be guaranteed by the common data stream also in radio transmissions.
However, it is a big problem for a breakthrough on the market if existing radio infrastructures have to be changed, i.e. if the problem does not only exist on the side of the decoder, but also on the side of the radio transmitters and the normalized transmission protocols. This concept is thus very disadvantageous due to the problem to change a system once it has been standardized and implemented.
The other alternative is not to couple the multi-channel additional information to the used audio coding system and thus not to insert it into the actual audio data stream. In this case, the transmission is done via a distinct parallel digital additional channel, which, however, does not necessarily have to be synchronized in time. This situation may occur when the downmix data are passed by a usual audio distribution infrastructure existing in studios in unreduced form, for example as PCM data by AES/EBU data format. These infrastructures are designed to digitally distribute audio signals between diverse sources. For this purpose, there are usually used functional units known as “cross rails”. Alternatively or additionally, audio signals are also processed in the PCM format for reasons of sound regulation and dynamic compression. All these steps result in incalculable delays on a path from the transmitter to the receiver.
On the other hand, the separate transmission of base channel data and multi-channel additional information is particularly interesting because existing stereo infrastructures do not have to be changed, i.e. the disadvantages of non-conformity with the standards described with respect to the first possibility do not apply here. A radio system only has to transmit an additional channel, but does not have to change the infrastructure for the already existing stereo channel. The additional effort is thus carried only, as it were, on the side of the receivers, but in a way that there is backward compatibility, i.e. that a user having a new receiver gets better sound quality than a user having an old receiver.
As already discussed, the order of magnitude of the time shift cannot be determined any more from the received audio signal and the additional information. Thus a reconstruction and association of the multi-channel signal that are correct in time are no longer guaranteed in the receiver. A further example of such a delay problem is when an already running two-channel transmission system is to be extended to multi-channel transmission, for example in a receiver of a digital radio. Here, it is often the case that the decoding of the downmix signal is done by means of a two-channel audio decoder already present in the receiver, whose delay time is not known and thus cannot be compensated. In an extreme case, the downmix audio signal may even reach the multi-channel reconstruction audio decoder via a transmission chain containing analog parts, i.e. that a digital/analog conversion is done at one point and that, after further storage/transmission, there is again an analog/digital conversion. Something like that occurs in radio transmission. Also, initially no clues are available as to how a suitable delay compensation of the downmix signal may be performed relative to the multi-channel additional data. Also, if the sample frequency for the A/D conversion and the sample frequency for the D/A conversion differ slightly from each other, there will be a slow time drift of the necessary compensation delay corresponding to the ratio of the two sample rates to each other.
For the synchronization of the additional data to the base data, various techniques may be used that are known by the term “time synchronization methods”. They are based on inserting time stamps into both data streams such that, based on these time stamps, a correct association of the data associated with each other may be achieved in the receiver. The insertion of time stamps, however, already results in a change of the normal stereo infrastructure.