A. Field of Invention
This invention pertains to a method and apparatus of enhancing a 3D movie by using 3D space information associated with at least some of the characters/objects that are either part of the scene or off scene to position associated audio objects in 3D space.
B. Description of the Prior Art
In the past, 3D movie or episodic visual content was prepared for analog film distribution or other relatively low fidelity analog or digital transmission, storage, projection and/or display 3D formats, e.g. anaglyph. Advances in 3D encoding formats, presentation technology, and digital signal processing have resulted in 3D movie or episodic visual content produced or post-produced on film or video, converted to digital formats where necessary, and then transmitted, projected and/or displayed digitally in higher quality 3D formats, e.g., stereoscopic HD 1920×1080p 3D Blu-ray Disc. In the present application, the term ‘digital 3D movie’ is used to refer to a 3D movie, episodic, or other 3D audiovisual content recorded, produced and/or converted into a digital format. This also includes content produced in 2D and then post-produced from 2D to 3D, as well as rendered from 3D animation systems.
The formats for the audio component of digital 3D movies can vary in terms of production, encoding, transmission and/or presentation. Typical presentation formats for the audio component may vary from mono to stereo to multi-channel such as 5.1, 6.1 or 7.1. Some of these audio formats include audio cues for depth perception such as amplitude differences, phase differences, arrival time differences, reverberant vs. direct sound source level ratios, tonal balance shifts, masking, and/or surround or multi-channel directionality. These cues can be tailored to enhance the presentation of a digital 3D movie so that audio 3D space perception complements the visual 3D space perception. In this manner, a digital 3D movie looks and ‘feels’ more realistic if the 3D position of a visual object of interest and associated audio are coincident.
When a digital 3D movie is prepared for distribution in some format or distribution channel, there may be relevant 3D visual information determined by analysis software and/or by an operator on a frame by frame, group of frames, or scene by scene basis and recorded in a respective log.
The conventional method of representing 3D depth information is via a z-axis depth map, which consists of a single 2-dimensional image that has the same spatial resolution as the 3D imagery (e.g. 1920×1080 for HD video). Each pixel of the image contains a gray-scale value corresponding to the depth of that particular pixel in the scene. For example, for an 8-bit data representation a gray-scale value of 256 (pure white) could represent the maximum positive 3D parallax (into the screen) 1, while a value of 0 (pure black) could represent the maximum negative parallax (out of the screen). The values can then be normalized based on the depth budget of the scene, e.g. a value of 256 could represent a pixel that is 100 feet away from the viewer whereas a value of 0 could represent a pixel which is 10 feet away from the viewer.
Another possible data representation of 3D depth information is a 3-dimensional depth volume, whereby each pixel in the 3D volume of the scene is represented by a particular value. Unlike the z-axis depth map the 3D depth volume is not limited to a single gray-scale value, and instead for each pixel both the color value (i.e. RGB value) of that particular pixel as well as the x-y-z coordinate of that pixel can be represented. Computer generated 3D imagery or other 3D visual effects techniques may more easily lend themselves to creating 3D depth volumes versus utilizing a 2D z-axis depth map. Such 3D representations of the depth information could be used for future display systems including holographic projection. Other data representations can be used to represent the depth information in a given scene including, but not limited to, 2D disparity maps and eigenvectors.
A 3D space map of whole frames' visual content, or of objects of interest within frames, may be determined when preparing to position subtitles or other graphics in 3D space over the background video.
Some objects of audio interest could have on-screen visual counterparts that can be tracked spatially. For example, as an on-screen actor moves and speaks in a scene, his position can be tracked both audibly and visually. For example, there are visual object-tracking software systems and software development kits (such as the SentiSight 3.0 kit of Neurotechnology, Vilnius, Latvia) that can detect and recognize visual objects within a scene and identify their specific locations. Such systems can tolerate in-plane rotation, some out-of-plane rotation, and a wide range of changes in scale. Such systems can also manage to track visual or audio objects that are occluded (e.g., as much as 50%). If motion vectors were to be used to plot the trajectory of objects that are either occluded to a greater degree, or even fully occluded visually, then object tracking could also identify locations of off-screen objects given sufficient, prior on-screen information. Other objects of audio interest, e.g., an actor speaking while off screen, or an actor speaking while being partially or fully occluded visually, may not be tracked. In this latter case, an on-screen actor might look directly across and past the screen plane boundary at another off-screen actor with whom he converses. Other audio objects of interest may not correspond to on screen visual objects at all depending upon positioning or editorial intent, e.g., an off-screen narrator's voice may be essential to a presentation, but there may be no on-screen item that corresponds to that voice.
However, in some instances during the preparation of a digital 3D movie its audio component may not include clear 3D space perception cues, either because these cues have been stripped away or because they were missing in the first place. This problem is compounded in real-time applications and environments such as video game rendering and live event broadcasting.
Just as there is a need to provide the audio component with cues for 3D space perception to enhance a digital 3D movie presentation, there is also a need to include such cues in the audio components of digital 3D movies in other formats. However, presently the preparation of digital 3D movies for release in one format does not include an efficient conversion of the audio component that insures the presence or preservation of the 3D space perception audio cues.
Therefore, an efficient scheme to optimize digital 3D movie preparation with audio 3D space perception cues is required. In addition, an efficient scheme to optimize additional digital 3D movie conversion with audio 3D space perception cues for other formats or distribution formats is required. In both cases, information gathered in digital 3D movie analysis is used as input to produce audio 3D space perception cues to enhance the 3D audiovisual experience.
Another problem arises in that currently a separate 2D version of the audio component, without 3D space perception cues, must be distributed for viewing of the content in 2D if the otherwise digital 3D movie is to be viewed in 2D, e.g. if there is no 3D display system available. Therefore, the data created in the course of encoding the audio 3D space perception cues can be saved and included with the digital 3D movie release file so that 3D-to-2D down-mixing can be managed downstream.