Embodiments of the invention pertain to one or more aspects of an audio content creation and distribution pipeline (e.g., a pipeline for creating and distributing the audio content of an audiovisual program).
Such a pipeline implements generation of an audio program (typically an encoded audio program indicative of audio content and metadata corresponding to the audio content). Generation of an audio program may include audio production activities (the capture and recording of audio), and optionally also “post production” activities (the manipulation of recorded audio). Live broadcast by necessity requires that all authoring decisions are made during audio production. In generation of cinema and other non-realtime programs, many authoring decisions may be made during post production.
An audio content creation and distribution pipeline optionally implements remixing and/or remastering of a program. In some cases a program may require additional processing after content creation to repurpose the content for an alternative use case. For example, a program originally created for playback in a cinema may be modified (e.g., remixed) to be more suitable for playback in a home environment.
An audio content creation and distribution pipeline typically includes an encoding stage. An audio program may require encoding to enable distribution. For example, a program intended for playback in the home will typically be data compressed to allow more efficient distribution. The encoding process may include steps of reducing the complexity of the spatial audio scene, and/or data rate reduction of individual audio streams of the program, and/or packaging of multiple channels of audio content (e.g., compressed audio content) and corresponding metadata into a bitstream having a desired format.
An audio content creation and distribution pipeline includes a stage of decoding and rendering (typically implemented by a playback system including a decoder). Ultimately the program is presented to the end consumer by rendering the audio description to loudspeaker signals based on the playback equipment and environment.
Typical embodiments of the invention allow audio programs (e.g., soundtracks of movies or other programs having audio and image content) to be reproduced such that the location of auditory images is reliably presented in a way that is consistent with the location of corresponding visual images.
Traditionally, in a cinema mixing room (or other audiovisual program authoring environment) the location and size of a display screen (referred to herein as a “reference” screen, to distinguish it from an audiovisual program playback screen) coincide with the front wall of the mixing environment and the left and right edges of the reference screen coincide with positions of left and right main screen loudspeakers. An additional center screen channel is generally located in the middle of the reference screen/wall. Thus the front wall extent, frontal loudspeaker locations, and screen location are consistently co-located. Typically, the reference screen is approximately as wide as the room, and the left, center, and right loudspeakers are near the left edge, center, and right edge of the reference screen. This arrangement is similar to the typical arrangement of the screen and frontal speakers in the expected movie theater playback location. For example, FIG. 1 is a diagram of the front wall (W) of such a movie theater, with display screen S, left and right front speakers (L and R), and front center speaker (C) mounted to (or near to) the front wall. During playback of a movie, a visual image B may be displayed on screen S, while an associated sound “A” is emitted from the speakers of the playback system (including speakers L, R, and C). For example, image B may be the image of a sound source (e.g., a bird or helicopter) and sound “A” may be sound intended to be perceived as emitting from the sound source. We assume that the movie has been authored and rendered so that sound A is perceived as emitting from a sound source location which coincides (or nearly coincides) with the location on screen S at which image B is displayed, when the frontal speakers are positioned coplanar with the screen S, with left front and right front speakers (L and R) at screen S's left and right edges, and a center front speaker near to screen S's center. FIG. 1 assumes that screen S is at least substantially acoustically transparent, and that speakers L, C, and R are mounted behind (but at least substantially in the plane of) screen S.
However, during playback in a consumer's home (or by mobile user's portable playback device), the size and positions of the frontal speakers (or headset speakers) of the playback system relative to each other and relative to the display screen of the playback system need not match those of the frontal speakers and display screen of the program authoring environment (e.g., cinema mixing room). In such playback cases, the width of the playback screen is typically significantly less than the distance separating left and right main speakers (left and right front speakers, or the speakers of a headset, e.g., a pair of headphones). It is also possible that the screen is not centered or even at a fixed position relative to the main speakers (e.g., in the case of a mobile user wearing headphones and holding a display device). This can create noticeable discrepancies between the perceived audio and visuals.
For example, FIG. 2 is a diagram of the front wall (W′) of a room with the display screen (S′), left and right front speakers (L′ and R′), and front center speaker (C′) of a home theater system mounted to (or near to) the front wall. During playback (by the FIG. 2 system) of the same movie described in the FIG. 1 example, visual image B is displayed on screen S′, while associated sound A is emitted from the speakers of the playback system (including speakers L′, R′, and C′). We have assumed that the movie has been authored for rendering and playback (by a movie theater playback system) with sound A perceived as emitting from a sound source location which coincides (or nearly coincides) with the location on a movie theater screen at which image B is displayed. However, when the movie is played by the home theater system of FIG. 2, sound A will be perceived as emitting from a sound source location, near to the left front speaker L′, which neither coincides nor nearly coincides with the location on home theater screen S′ at which image B is displayed. This is because the frontal speakers L′, C′, and R′ of the home theater system have different sizes and positions relative to screen S′ than the frontal speakers of the program authoring system have relative to the reference screen of the program authoring system.
In the example of FIGS. 1 and 2, the expected cinema playback system is assumed to have a well-defined relationship between its loudspeakers and screen, and thus the content creator's desired relative locations for the displayed images and corresponding audio sources can be reproduced reliably (during playback in a cinema). For playback in other environments (e.g., in a home audio-video room), the assumed relationship between loudspeakers and screen is typically not preserved, and thus the relative locations of the displayed images and corresponding audio sources (which are desired by the content creator) are typically not well reproduced. The relative locations of displayed images and corresponding audio sources actually achieved during playback (other than in a cinema having the assumed relationship between loudspeakers and screen) are based on the actual relative locations and sizes of the playback system's loudspeakers and display screen.
During playback of an audiovisual program, for sounds that are rendered to be perceived at on-screen locations, the optimal auditory image position is independent of the listener position. For sounds that are rendered to be perceived at off-screen locations (at a non-zero distance in a direction perpendicular to the plane of the screen), there is potential for parallax errors in the aurally perceived location of the sound source, depending on the listener position. Methods have been proposed which attempt to minimize or eliminate such parallax error based on a known or assumed listener position.
It is known to employ high-end playback systems (e.g., in movie theaters) to render object based audio programs (e.g., object based programs indicative of movie soundtracks). For example, object based audio programs which are movie soundtracks may be indicative of many different sound elements (audio objects) corresponding to images on a screen, dialog, noises, and sound effects that emanate from different places on (or relative to) the screen, as well as background music and ambient effects (which may be indicated by speaker channels of the program) to create the intended overall auditory experience. Accurate playback of such programs requires that sounds be reproduced in a way that corresponds as closely as possible to what is intended by the content creator with respect to audio object size, position, intensity, movement, and depth.
Object based audio programs represent a significant improvement over traditional speaker channel-based audio programs, since speaker-channel based audio is more limited with respect to spatial playback of specific audio objects than is object channel based audio. The audio channels of speaker channel-based audio programs consist of speaker channels only (not object channels), and each speaker channel typically determines a speaker feed for a specific, individual speaker in a listening environment.
Various methods and systems for generating and rendering object based audio programs have been proposed. During generation of an object based audio program, it is typically assumed that an arbitrary number of loudspeakers will be employed for playback of the program, and that the loudspeakers to be employed (typically, in a movie theater) for playback will be located in arbitrary locations in the playback environment; not necessarily in a (nominally) horizontal plane or in any other predetermined arrangement known at the time of program generation. Typically, object-related metadata included in the program indicates rendering parameters for rendering at least one object of the program at an apparent spatial location or along a trajectory (in a three dimensional volume), e.g., using a three-dimensional array of speakers. For example, an object channel of the program may have corresponding metadata indicating a three-dimensional trajectory of apparent spatial positions at which the object (indicated by the object channel) is to be rendered. The trajectory may include a sequence of “floor” locations (in the plane of a subset of speakers which are assumed to be located on the floor, or in another horizontal plane, of the playback environment), and a sequence of “above-floor” locations (each determined by driving a subset of the speakers which are assumed to be located in at least one other horizontal plane of the playback environment). Examples of rendering of object based audio programs are described, for example, in PCT International Application No. PCT/US2011/028783, published under International Publication No. WO 2011/119401 A2 on Sep. 29, 2011, and assigned to the assignee of the present application.
The advent of object based audio program rendering has significantly increased the amount of the audio data processed and the complexity of rendering that must be performed by rendering systems, in part because an object based audio program may be indicative of many objects (each with corresponding metadata) and may be rendered for playback by a system including many loudspeakers. It has been proposed to limit the number of object channels included in an object based audio program so that an intended rendering system has capability to render the program. For example, U.S. Provisional Patent Application No. 61/745,401, entitled “Scene Simplification and Object Clustering for Rendering Object based Audio Content,” filed on Dec. 21, 2012, naming Brett Crockett, Alan Seefeldt, Nicolas Tsingos, Rhonda Wilson, and Jeroen Breebaart as inventors, and assigned to the assignee of the present invention, describes methods and apparatus for so limiting the number of object channels of an object based audio program by clustering input object channels to generate clustered object channels which are included in the program and/or by mixing audio content of input object channels with speaker channels to generate mixed speaker channels which are included in the program. It is contemplated that some embodiments of the present invention may be performed in conjunction with such clustering (e.g., in a mixing or remixing facility) to generate an object based program for delivery (with screen-related metadata) to a playback system, or for use in generating a speaker channel-based program for delivery to a playback system.