1. Field of the Invention
This invention relates generally to lossy, multi-channel audio compression and decompression generally, and more specifically to compression and decompression of downmixed, multi-channel audio signals in a manner that facilitates upmix of the received and decompressed multi-channel audio signals.
2. Description of the Related Art
Audio and audio-visual entertainment systems have progressed from humble beginnings, capable of reproducing monaural audio through a single speaker. Modern surround-sound systems are capable of recording, transmitting, and reproducing a plurality of channels, through a plurality of speakers in a listener environment (which may be a public theater or a more private “home theater.”). A variety of Surround sound speaker arrangements are available: these go by such designations as “5.1 surround,” “7.1 surround,” and even 20.2 surround (where the numeral to the right of the decimal point indicates a low frequency effects channel). For each such configuration, various physical arrangements of speakers are possible; but in general the best results will be realized if the rendering geometry is similar to the geometry presumed by the audio engineers who mix and master the recorded channels.
Because various rendering environments and geometries are possible beyond the prediction of the mixing engineers, and because the same content may be played back in diverse listening configurations or environments, the multiplicity of surround sound configurations presents numerous challenges to the engineer or artist wishing to deliver a faithful listening experience. Either a “channel-based” or (more recently) an “object-based” approach may be employed to deliver the surround sound listening experience.
In a channel-based approach, each channel is recorded with the intention that it should be rendered during playback on a corresponding speaker. The physical arrangement of the intended speakers is predetermined or at least approximately assumed during mixing. In contrast, in an object-based approach a plurality of independent audio objects are recorded, stored, and transmitted separately, preserving their synchronous relationship, but independent of any presumptions about the configuration or geometry of the intended playback speakers or environment. Examples of audio objects would be a single musical instrument, an ensemble section such as a viola section considered as a unitary musical voice, a human voice, or a sound effect. In order to preserve spatial relationships, the digital data representing the audio objects includes for each object certain data (“metadata”) symbolizing information associated with the particular sound source: for example, the vector direction, proximity, loudness, motion, and extent of the sound source can be symbolically encoded (preferably in a manner capable of time variation) and this information is transmitted or recorded along with the particular sound signal. The combination of an independent sound source waveform and the associated metadata together comprise an audio object (stored as an audio object file). This approach has the advantage that it can be rendered flexibly, in many different configurations; however, the burden is placed on the rendering processor (“engine”) to calculate the proper mix based on the geometry and configuration of the playback speakers and environment.
In both channel-based and object-based approaches to audio, it is frequently desirable to transmit a downmixed signal (A plus B) in such a way that the two independent channels (or objects, A and B) may be separated (“upmixed”) during playback. One motivation to transmit a downmix might be to keep backward compatibility, so that a downmixed program can be played on monaural, conventional two-channel stereo, or (more generally) on a system with fewer speakers than the number of channels or objects in the recorded program. In order to recover the higher plurality of channels or objects, an upmixing process is employed. For example, if one transmits the sum C of signals A and B (A+B), and if one also transmits B, then the receiver can easily construct A (A+B−B)=A. Alternatively, one may transmit composite signals (A+B) and (A−B), then recover A and B by taking linear combinations of the transmitted composite signals. Many prior systems use variations of this “matrix mixing” approach. These are somewhat successful at recovering discrete channels or objects. However, when large numbers of channels or especially objects are summed, it becomes difficult to adequately reproduce individual discrete objects or channels without either artifacts or impractically high bandwidth requirements. Because object-based audio often involves very high numbers of independent audio objects, great difficulties are involved in effective upmixing to recover discrete objects from downmixed signals, particularly where data-rate (or more generally, bandwidth) is constrained.
In most practical systems for transmission or recording of digital audio, some method of data compression will be highly desirable. Data rate is always subject to some constraint, and it is always desired to transmit audio more efficiently. This consideration becomes increasingly important when a large number of channels are employed—either as discrete channels or upmixed. In the present application the term “compression” refers to methods of reducing data requirement to transmit or record audio signals, whether the result is data-rate reduction or file size reduction. (This definition should not be confused with dynamic range compression, which is also sometimes referred to as “compression” in other audio contexts not relevant here).
Prior approaches to compressing downmixed signals generally adopt one of two methods: Lossless coding or redundant description. Either can facilitate upmix after decompression, but both have drawbacks.
Lossless and Lossy Coding:
Assume A, B1, B2, . . . , Bm are independent signals (objects), which are encoded in a code stream and sent to a renderer. Distinguished object A will be referred to as the base object, while B=B1, B2, . . . , Bm will be referred to as regular objects. In an object-based audio system, we are interested in rendering objects simultaneously but independently, so that, for example, each object could be rendered at a different spatial location.
Backward compatibility is desirable: in other words, we require that the coded stream be interpretable by legacy systems that are neither object-based nor object-aware, or which are capable of fewer channels. Such systems can only render the composite object or channel C=A+B1+B2+ . . . +Bm from an encoded (compressed) version, E(C), of C. Therefore, we require that the code stream include E(C) be transmitted, followed by descriptions of the individual objects, which are ignored by the legacy systems. Thus, the code stream may consist of E(C) followed by descriptions E(B1), E(B2), . . . , E(Bm) of the regular objects. The base object A is then recovered by decoding these descriptions and setting A=C−B1−B2− . . . −Bm. It should be noted, however, that most audio codecs used in practice are lossy, meaning that the decoded version Q(X)=D(E(X)) of a coded object E(X) is only an approximation of X, and thus not necessarily identical to it. The accuracy of the approximation generally depends on the choice of codec and on the bandwidth (or storage space) available for the code stream. While a lossless encoding is possible, i.e. Q(X)=X, it usually requires significantly larger bandwidth or storage space than a lossy encoding. The latter, on the other hand, can still provide a high quality reproduction that may be perceptually indistinguishable from the original.Redundant Description:
An alternative approach is to include an explicit encoding of certain privileged objects A in the code stream, which would therefore consist of E(C), E(A), E(B1), E(B2), . . . , E(Bm). Assuming E is lossy, this approach is likely to be more economical than using a lossless encoding, but is still not an efficient use of bandwidth. The approach is redundant, since E(C) is obviously correlated to the individually encoded objects E(A), E(B1), E(B2), . . . , E(Bm).