The present invention relates to audio encoding and audio decoding, in particular to an encoding and decoding scheme, selectively extracting and/or transmitting phase information, when reconstruction of such information is perceptually relevant.
Recent parametric multi-channel coding-schemes like binaural cue coding (BCC), parametric stereo (PS) or MPEG surround (MPS) use a compact parametric representation of the humans auditory system's cues for spatial perception. This allows for a rate efficient representation of an audio signal having two or more audio channels. To this end, an encoder performs a down-mix from M-input channels to N-output channels and transmits the extracted cues together with the down-mix signal. The cues are furthermore quantized according to the principles of human perception, that is, information which is not audible or distinguishable by the human auditory system may be deleted or coarsely quantized.
As the downmix-signal is a “generic” audio signal, the bandwidth consumed by such an encoded representation of an original audio signal may be further decreased by compacting the down-mix signal or the channels of the downmix signal using single channel audio compressors. Various types of those single channel audio compressors will be summarized as core coders within the following paragraphs.
Typical cues used to describe the spatial interrelation between two or more audio channels are interchannel level differences (ILD) parametrizing level relations between input channels, interchannel cross correlations/coherences (ICC) parametrizing the statistical dependency between input channels and interchannel time/phase differences (ITD or IPD) parametrizing the time or phase difference between similar signal segments of input channels.
To maintain a high perceptual quality of the signals represented by a down-mix and the previously described cues, individual cues are normally calculated for different frequency bands. That is, for a given time segment of the signal, multiple cues parametrizing the same property are transmitted, each cue-parameter representing a predetermined frequency band of the signal.
The cues may be calculated time- and frequency dependent on a scale close to the human frequency resolution. Whenever multi-channel audio signals are represented, a corresponding decoder performs an upmix from M to N channels based on the transmitted spatial cues and the downmix transmitted signals (the transmitted downmix therefore often being called the carrier signal).
Generally, a resulting upmix channel may be described as a level- and phase weighted version of the transmitted downmix. The decorrelation derived while encoding the signals may be synthesized by mixing and weighting the transmitted downmix signal (the “dry” signal) with a decorrelated signal (the “wet” signal) derived from the downmix signal as indicated by the transmitted correlation parameters (ICC). The upmixed channels then have a similar correlation with respect to each other than the original channels had. A decorrelated signal (i.e. a signal having a cross correlation coefficient close to zero when cross-correlated with the transmitted signal) may be produced by feeding the downmix to a chain of filters, as for example, all-pass filters and delay lines. However, further ways of deriving a decorrelated signal may be used.
Apparently, in a particular implementation of the above encoding/decoding scheme, a trade-off between the transmitted bitrate (ideally being as low as possible) and the achievable quality (ideally being as high as possible) of the encoded signal, has to be performed.
It may, therefore, be decided to not transmit a full set of spatial cues, but to omit transmission of one particular parameter. This decision may additionally be influenced by the selection of an appropriate upmix. An appropriate upmix could, for example, reproduce a spatial cue not transmitted on the average. That is, at least for a long-term segment of the full bandwidth signal, the average spatial property is preserved.
In particular, not all of the parametric multi-channel schemes make use of interchannel time or interchannel phase differences, thus avoiding the respective calculation and synthesis. Schemes like MPEG surround rely on synthesis of ILDs and ICCs only. The interchannel phase-differences are implicitly approximated by the decorrelation synthesis, which mixes two representations of the decorrelated signal to the transmitted downmix signal, wherein the two representations have a relative phase shift of 180°. A transmission of IPDs is omitted, thus reducing the amount of parametric information, at the same time, accepting a degradation in reproduction quality.