A. Field of Invention
The present disclosure relates to the production and configuration of Virtual Reality or Augmented Reality presentations. More particularly, this invention pertains to a method and apparatus for enhancing a Virtual Reality and/or Augmented Reality presentation (hereafter referred to as a ‘VR/AR presentation’) by using 3D space information associated with at least some of the visual characters and other objects of interest that are either part of the actively viewed scene, or outside the field of view, to position the associated audio characters/objects of interest (hereafter referred to as ‘audio objects’ or ‘audio objects of interest’) in the 3D space of the VR/AR presentations. Moreover the apparatus and method further provides for the augmentation of audio objects using characteristics of the visual environment for the VR presentations and the characteristics of actual environments for the AR presentations.
B. Description of the Prior Art
In the past, a 3D movie or other similar episodic audio/visual content was prepared for analog film distribution or other relatively low fidelity analog or digital transmission, storage, projection and/or display 3D formats, e.g. anaglyph 3D. Advances in 3D encoding formats, presentation technology, and digital signal processing have resulted in 3D movie or episodic visual content produced in much higher quality 3D formats, e.g., stereoscopic HD 1920×1080p, 3D Blu-ray Discs, etc.
“Virtual Reality” is a term that has been used for various types of content that simulates immersion in a partially or wholly computer-generated and/or live action three-dimensional world. Such content may include, for example, various video games and animated film content. A variation of these technologies is sometimes called “Augmented Reality.” In an Augmented Reality presentation, an actual 3D presentation of the current surroundings of a user that is ‘augmented’ by the addition of one or more virtual objects or overlays. Augmented Reality content may be as simple as textual ‘heads up’ information about objects or people visible around the user, or as complex as transforming the entire appearance of the user's surroundings into an imaginary environment that corresponds to the user's real surroundings. Advances in encoding formats, presentation technology, motion tracking, position tracking, eye tracking, portable accelerometer and gyroscopic output/input, and related signal processing have reached a point where both virtual and augmented reality presentations can be displayed to a user in real time.
Virtual Reality (VR) and Augmented Reality (AR) have been implemented in various types of immersive video stereoscopic presentation techniques including, for example, stereoscopic VR headsets. As mentioned above, 3D headsets and other 3D presentation devices immerse the user in a 3D scene. Lenses in the headset enable the user to focus on a lightweight split display screen mounted in the headset positioned inches from the user's eyes. In some headset types, different sides of the split display show right and left stereoscopic views of video content, while the user's peripheral view is blocked or left partially unobstructed below the central field of view. In another type of headset, two separate displays are used to show different images to the user's left eye and right eye respectively. In another type of headset, the field of view of the display encompasses the full field of view of each eye including the peripheral view. In another type of headset, in order to achieve either AR or VR, an image is projected on the user's retina using controllable small lasers, mirrors or lenses. Either way, the headset enables the user to experience the displayed VR or AR content in a manner that makes the user feel like he was immersed in a real scene. Moreover, in the case of AR content, the user may experience the augmented content as if it were a part of, or placed in, an augmented real scene. VR or AR content can be presented to a viewer as a 360° picture as well that can be presented on a standard screen with the image moving left or right and/or up and down either automatically or under the control of the viewer.
The immersive AR/VR effects may be provided or enhanced by motion sensors in a headset (or elsewhere) that detect motion of the user's head, and adjust the video display(s) accordingly. By turning his head to the side, the user can see the VR or AR scene off to the side; by turning his head up or down, the user can look up or down in the VR or AR scene. The headset (or other device) may also include tracking sensors that detect position of the user's head and/or body, and adjust the video display(s) accordingly. By leaning or turning, the user can see a VR or AR scene from a different point of view. This responsiveness to head movement, head position and body position greatly enhances the immersive effect achievable by the headset. The user may thus be provided with the impression of being placed inside or ‘immersed’ in the VR scene. As used herein, “immersive” generally encompasses both VR and AR presentations.
Immersive headsets and other wearable immersive output devices are especially useful for game play of various types, which involve user exploration of a modeled environment generated by a rendering engine as the user controls one or more virtual camera(s) or displays using head movement, the position or orientation of the user's body, head, eye, hands, fingers, feet, or other body parts, and/or other inputs using sensors such as accelerometers, altimeters, GPS receivers, Electronic Tape Measures, Laser Distance Finders, laser[[,]] or sound Digital Measuring Devices, gyroscopic sensors and so on. To provide an immersive experience, the user needs to perceive a freedom of movement that is in some way analogous to human visual and aural perception when interacting with reality.
Content produced for VR/AR presentations can provide this experience using techniques for real-time rendering that have been developed for various types of video games. The content may be designed as a three-dimensional computer model with defined boundaries and rules for rendering the content as a video signal. This content can be enhanced by stereoscopic techniques to provide stereoscopic video output, sometimes referred to as “3D,” and associated with a VR/AR presentation that manages the rendering process in response to movement of the 3D headset, or head, eye, hand, finger, foot or other body part (or body part appendage such as a ‘magic’ wand or golf club) movement, and/or other inputs such as the sensors mentioned above to produce a resulting digital VR/AR presentation and user experience. The user experience can be very much like being placed or immersed inside a rendered video game environment.
In other types of VR/AR presentations, the simulated 3D environment may be used primarily to tell a story, more like traditional theater or cinema. In these types of presentation, the added visual effects may enhance the depth and richness of the story's narrative elements or special effects, without giving the user full control (or any control) over the narrative itself. However, a rich mixed reality experience is provided that progresses differently during each encounter (or viewing), as opposed to a standard linear book or movie wherein a set narrative or sequence of scenes is presented having a single ending. This experience depends upon direction from the viewer which way to look, for example, though clearly this can be influenced and directed by narrative cues, as well some random elements that may be introduced by the software. As a result, the narrative is not linear or [[ ]] predictable at the outset but variable due, for example to choices made by the viewer and other factors. In other words, as a joint result of viewer choices and other factors in concert with the mixed reality environment, the narrative or story being presented can evolve dramatically on the fly, creating tension and release, surprises, linear or non-linear progress, turning points, or dead ends. These considerations are especially applicable to unscripted presentations which in some sense have variable, dynamically changing sequences similar to games or live reality shows. It is especially important for these kinds of presentations to insure that both the audio and visual signals are as realistic as possible so that the presentations appear realistic and not fake or artificial.
In the present application, the term ‘digital VR/AR presentation’ is used to refer to videogame, movie, episodic, or other audiovisual content recorded, produced, rendered, and/or otherwise generated in a digital format, or audiovisual content recorded, produced, rendered or otherwise generated in a digital format to be overlaid on reality. The term also covers content produced in 2D, content produced in 2D and then post-produced from 2D to 3D, content produced natively in 3D, as well as content rendered from 3D animation systems.
When a digital VR/AR presentation is prepared for distribution in some format or distribution channel, there may be relevant 3D visual information determined by analysis software and/or by an operator on a frame by frame, group of frames, or scene by scene basis and recorded in a respective log.
The conventional method of representing 3D depth information is via a z-axis depth map, which consists of a single 2-dimensional image that has the same spatial resolution as the 3D imagery (e.g. 1920×1080 for HD video). Each pixel of the image contains a gray-scale value corresponding to the depth of that particular pixel in the scene. For example, for an 8-bit data representation a gray-scale value of 256 (pure white) could represent the maximum positive 3D parallax (into the screen), while a value of 0 (pure black) could represent the maximum negative parallax (out of the screen). The values can then be normalized based on the depth budget of the scene, e.g. a value of 256 could represent a pixel that is 100 feet away from the viewer whereas a value of 0 could represent a pixel which is 10 feet away from the viewer. Another possible data representation of 3D depth information is a 3-dimensional depth volume, whereby each pixel in the 3D volume of the scene is represented by a particular value. Unlike the z-axis depth map the 3D depth volume is not limited to a single gray-scale value, and instead for each pixel both the color value (i.e. RGB value) of that particular pixel as well as the x-y-z coordinate of that pixel can be represented. Computer generated 3D imagery or other 3D visual effects techniques may more easily lend themselves to creating 3D depth volumes versus utilizing a 2D z-axis depth map. Such 3D representations of the depth information could be used for future display systems including holographic projection. Other data representations can be used to represent the depth information in a given scene including, but not limited to, 2D disparity maps and eigenvectors.
As part of generating an VR/AR presentation, a 3D space map of the frames' visual content can be generated, or of objects of interest within frames, may be determined when preparing to position subtitles or other graphics in 3D space over the background video.
Some audio objects of interest could have on-screen visual counterparts that can be tracked spatially. For example, as an on-screen actor moves and speaks in a scene, his position can be tracked both audially and visually. For example, there are visual object-tracking software systems and software development kits (such as the SentiSight 3.0 kit of Neurotechnology, Vilnius, Latvia) that can detect and recognize visual objects within a scene and identify their specific locations. Such systems can tolerate in-plane rotation, some out-of-plane rotation, and a wide range of changes in scale. Such systems can also manage to track visual or audio objects that are occluded (e.g., as much as 50%). If motion vectors were to be used to plot the trajectory of objects that are either occluded to a greater degree, or even fully occluded visually, then object tracking could also identify locations of off-screen objects given sufficient, prior on-screen information, or even post on-screen info for pre-authored sequences. Other audio objects of interest, e.g., an actor speaking while off screen, or an actor speaking while being partially or fully occluded visually, may not be tracked. In this latter case, an on-screen actor might look directly across and past the screen plane boundary at another off-screen actor with whom he converses. Other audio objects of interest may not correspond to on screen visual objects at all depending upon positioning or editorial intent, e.g., an off-screen narrator's voice may be essential to a presentation, but there may be no on-screen item that corresponds to that voice.
However, in some instances during the preparation of a digital VR/AR presentation its audio component or parts of the audio components relating to audio objects of interest may not include clear 3D space perception cues, either because these cues have been stripped away or otherwise lost, or because they were missing in the first place. This problem is compounded in real-time applications and environments such as video game rendering and live event broadcasting.
Just as there is a need to provide the audio component with cues for 3D space perception to enhance a digital VR/AR presentation, there is also a need to include such cues in the audio components of digital VR/AR presentations in other formats. However, presently the preparation of digital VR/AR presentations for release in one format does not include an efficient conversion of the audio component that insures the presence or preservation of the 3D space perception audio cues in the digital VR/AR presentation released in an additional format.
Therefore, an efficient scheme to optimize digital VR/AR presentation preparation with audio 3D space perception cues is required. In addition, an efficient scheme to optimize additional digital VR/AR presentation conversion with audio 3D space perception cues for other formats or distribution formats is required. In both cases, information gathered in digital 3D video analysis is used as input to produce audio 3D space perception cues to enhance the 3D audiovisual experience.
Another problem arises in that currently a separate 2D version of the audio component, without 3D space perception cues, may be distributed for viewing of the content in 2D if the otherwise digital VR/AR presentation is to be viewed in 2D, e.g. if there is no digital VR/AR presentation system available, i.e. no VR/AR relevant headset and/or no 3D display. Therefore, the data created in the course of encoding the audio 3D space perception cues can be saved and included with the digital VR/AR presentation release file so that 3D-to-2D down-mixing can be managed downstream.
Audio and video both largely create the VR/AR presentations and resulting experiences at issue. {Here we are not concerned with so-called “4D” theatrical presentations wherein aromas and/or moving seats, and/or water (e.g. ‘rain’) dispensers etc. are used to enhance the otherwise normal theatrical presentation.} So, a VR/AR presentation will be enhanced, and therefore the user experience will be more enveloping and powerful, if audio cues related to the position of objects of interest in the VR/AR presentation complement the video, as relevant audio cues underscore the visual position of objects of interest in real life, e.g. a fire engine racing by in one's visual field, preceded by its siren first at low amplitude and relatively low pitch when it is far away, then louder and higher pitched as it arrives, then fading away in amplitude and pitch as it passes into the distance, with the apparent sound source rising upward as the fire engine exits the shot driving up a hill.
The formats for the audio component of digital VR/AR presentations can vary in terms of production, encoding, transmission, generation, and/or presentation. Typical presentation formats for the audio component may vary from mono to stereo to multi-channel such as 5.1, 6.1, 7.1 or so-called ‘object oriented’ or ‘immersive’ audio. Some of these audio formats include audio cues for depth perception such as amplitude differences, phase differences, arrival time differences, reverberant vs. direct sound source level ratios, tonal balance shifts, masking, and/or surround or multi-channel directionality. These cues can be tailored in light of video object spatial position data to enhance the presentation of a digital VR/AR presentation so that audio 3D space perception in X, Y and Z axes complements visual 3D space perception. In this manner, a digital VR/AR presentation looks and ‘feels’ more realistic if the 3D position of a visual object of interest and associated audio are coincident.
It would be desirable, therefore, to develop methods and apparatus that not only provide audio tracks indicative of the position of objects of interest in VR/AR presentations but also adjust the audio tracks to better match the environments in which the objects are placed to enhance the appeal and enjoyment of VR and AR content for more immersive VR/AR presentations.