Conventional channel-based audio encoders typically operate under the assumption that each audio program (that is output by the encoder) will be reproduced by an array of loudspeakers in predetermined positions relative to a listener. Each channel of the program is a speaker channel. This type of audio encoding is commonly referred to as channel-based audio encoding.
Another type of audio encoder (known as an object-based audio encoder) implements an alternative type of audio coding known as audio object coding (or object based coding and operates under the assumption that each audio program (that is output by the encoder) may be rendered for reproduction by any of a large number of different arrays of loudspeakers. Each audio program output by such an encoder is an object based audio program, and typically, each channel of such object based audio program is an object channel. In audio object coding, audio signals associated with distinct sound sources (audio objects) are input to the encoder as separate audio streams. Examples of audio objects include (but are not limited to) a dialog track, a single musical instrument, and a jet aircraft. Each audio object is associated with spatial parameters, which may include (but are not limited to) source position, source width, and source velocity and/or trajectory. The audio objects and associated parameters are encoded for distribution and storage. Final audio object mixing and rendering is performed at the receive end of the audio storage and/or distribution chain, as part of audio program playback. The step of audio object mixing and rendering is typically based on knowledge of actual positions of loudspeakers to be employed to reproduce the program.
Typically, during generation of an object based audio program, the content creator embeds the spatial intent of the mix (e.g., the trajectory of each audio object determined by each object channel of the program) by including metadata in the program. The metadata can be indicative of the position or trajectory of each audio object determined by each object channel of the program, and/or at least one of the size, velocity, type (e.g., dialog or music), and another characteristic of each such object.
During rendering of an object based audio program, each object channel can be rendered (“at” a time-varying position having a desired trajectory) by generating speaker feeds indicative of content of the channel and applying the speaker feeds to a set of loudspeakers (where the physical position of each of the loudspeakers may or may not coincide with the desired position at any instant of time). The speaker feeds for a set of loudspeakers may be indicative of content of multiple object channels (or a single object channel). The rendering system typically generates the speaker feeds to match the exact hardware configuration of a specific reproduction system (e.g., the speaker configuration of a home theater system, where the rendering system is also an element of the home theater system).
In the case that an object based audio program indicates a trajectory of an audio object, the rendering system would typically generate speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived (and which typically will be perceived) as emitting from an audio object having said trajectory. For example, the program may indicate that sound from a musical instrument (an object) should pan from left to right, and the rendering system might generate speaker feeds for driving a 5.1 array of loudspeakers to emit sound that will be perceived as panning from the L (left front) speaker of the array to the C (center front) speaker of the array and then the R (right front) speaker of the array. Herein, “trajectory” of an audio object (indicated by an object based audio program) is used in a broad sense to denote the position or positions (e.g., position as a function of time) from which sound emitted during rendering of the program is the object is intended to be perceived as emitting. Thus, a trajectory could consist of a single, stationary point (or other position), or it could be a sequence of positions, or it could be a point (or other position) which varies as a function of time.
However, until the present invention it had not been known how to render an object based audio program (which is indicative of a trajectory of an audio source) by generating speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived as emitting from the source but with said source having a different trajectory than the one indicated by the program. Typical embodiments of the invention are methods and systems for rendering an object based audio program (which is indicative of a trajectory of an audio source), including by efficiently generating speaker feeds for driving a set of loudspeakers to emit sound intended to be perceived as emitting from the source but with said source having a different trajectory than the one indicated by the program (e.g., with said source having a trajectory in a vertical plane, or a three-dimensional trajectory, where the program indicates the source's trajectory is in a horizontal plane).
There are many conventional methods for rendering audio programs in systems that employ channel-based audio encoding. For example, conventional upmixing techniques could be implemented during rendering of the audio programs (comprising speaker channels) which are indicative of sound from sources moving along trajectories within a subspace of a full three-dimensional volume (e.g., trajectories which are along horizontal lines), to generate speaker feeds for driving speakers positioned outside this subspace. Such upmixing techniques are based on phase and amplitude information included in the program to be rendered, whether this information was intentionally coded (in which case the upmixing can be implemented by matrix encoding/decoding with steering) or is naturally contained in the speaker channels of the program (in which case the upmixing is blind upmixing). Thus, the conventional phase/amplitude-based upmixing techniques which have been applied to audio programs comprising speaker channels are subject to a number of limitations and disadvantages, including the following:
whether the content is matrix encoded or not, they generate a significant amount of crosstalk across speakers;
in the case of blind upmixing, the risk of panning a sound in a non-coherent way with video is greatly increased, and the typical way to lower this risk is to upmix only what appears to be non-directional elements of the program (typically decorrelated elements); and
they often create artifacts either by limiting the steering logic to wide band, often making the sound collapse during reproduction, or by applying a multiband steering logic that creates a spatial smearing of the frequency bands of a unique sound (sometimes referred to as “the gargling effect”).
Even if conventional phase/amplitude-based techniques for upmixing audio programs comprising speaker channels (to generate upmixed programs having more speaker channels than the input programs) were somehow applied to object based audio programs (to generate speaker feeds for more loudspeakers than could be generated from the input programs without the upmixing), this would result in a loss of perceived discreteness (of the audio objects indicated by the upmixed programs) and/or would generate artifacts of the type described above. Thus, systems and related methods are needed for rectifying the deficiencies noted above.