A wide variety of computing devices such as gaming consoles, virtual-reality equipment, augmented-reality equipment, mixed-reality equipment, smart televisions, set top boxes, desktop computers, laptops, smartphones, and specialty devices such as iPods, are available to consumers. The computing capabilities of many of these devices can be harnessed by creative content producers to provide very rich, immersive and interactive media content.
For example, filmmakers, digital content creators and technology developers have been developing 360-video capture systems, corresponding authoring tools and compatible media players to create and present interactive and immersive media experiences for a variety of platforms including virtual reality. Such video capture systems include multiple individual but coordinated video cameras positioned in an array in which each camera has a unique position and field of view to collectively capture video that spans 360×180 degrees. Frames of the captured digital video from the video cameras are synchronized and stitched together using image processing algorithms to produce video frames each of which contains 360×180 content. Each of these video frames is typically stored in an equirectangular format, to facilitate straightforward projection onto a geometry such as a spherical mesh for playback.
A user can be provided with the impression that he or she is positioned at the centre of the sphere looking outward towards the captured scenes, in a manner analogous to the position of the cameras during video capture. In addition, the user may be provided with the ability to adjust his or her perspective and field of view, such as by using a mouse on a desktop-style system, a touch-display on a typical smartphone, or actual physical movement using virtual reality headgear (Head Mounted Display, or HMD), in order to face any part of the 360×180 video that is being played back. In this way, the user can “look around” and in any direction will see the respective portions of the film unfolding as it is played back just as one can look around in reality.
Processes for producing digital video from raw content such as that captured by a 360-video capture system are well understood. Speciality software tools are used to stitch together the content from the different camera angles to produce the raw video. Then, the raw video can be edited and spliced with other video, graphic overlays and the like, on a computer workstation using software tools. When the author/editor is satisfied with the content, the digital video is considered “locked,” and post-production tools can be used to convert the locked digital video into a form suitable for transmission, playback and storage using various media players, devices and the like. For example, it is typical to encode raw video into a format such as MP4 using H.264 or H.265 to compress the video so that the overall file in which it is contained is smaller and wieldy for storage and transmission. Encoders are sets of hardware and software that receive the original raw digital video content as input and that output an encoded digital video file. Transcoders are sets of hardware and software that receive an encoded video file and re-encode the file into a different encoded format. Decoders are sets of hardware and software that receive an encoded video file, and extract each frame as pixel data so that the pixel data can be inserted into a memory buffer which can be later stored in a frame buffer for subsequent display by a display device. Together, coders/transcoders and decoders are typically referred to as codecs.
There are challenges to producing content that can be enjoyed on a wide variety of computing devices, systems and platforms. For example, numerous codecs are available. Because of the nature of the algorithms used in codecs and the way decoded frames are buffered prior to display, codecs do not generally enable a playback device such as a media player to know exactly which frame has been decoded. Instead, some codecs produce and expose an elapsed time from which an approximation as to an actual frame number can be derived. As such, due to this nature of compression algorithms and buffering, the playback time indicated by a media player's decoder for a particular frame may not, for example, coincide with the actual playback time that might be indicated by another media player's decoder for the same frame. For example, there may be a disparity on the order of 5 to 10 frames on a 30 frame per second (fps) playback.
When authoring digital videos, an author/editor may wish to add certain events which are to be triggered by a media player during playback. Parameters specifying such events and their exact timing may be stored as metadata in a file associated with the digital video, and be identified according to frame number or playback time. For example, the author/editor may wish to trigger the media player to live-render and display a particular graphical overlay caption or subtitle, to play a sound and/or to launch an interactive event just as a particular frame is displayed. As another example, the author/editor may wish to trigger the media player to play a separate and independent video overlaid atop a 360 video, just as a particular frame of the 360 video is displayed. Such graphical elements, auditory cues, interactive events and videos independent from the 360 video can be very useful to content authors. This is the case because such additional events do not have to be “baked into” the main digital video itself. They can be rendered independently and/or in parallel with the main video, greatly expanding the possible real time interactive nature of the experience during the viewing of such video. It also allows such additional events to be fine-tuned in subsequent productions without requiring re-working of the digital video itself.
Frame-accurate event-triggering is crucial for certain kinds of events. As an example, for digital video that switches between 360-video and traditional non-spherical video segments, a media player must know precisely at which frame to switch from displaying video frames as flat projections to displaying the video frames as spherical projections, and vice versa. When a frame and its projection are not matched, the experience for a user will become jarring and difficult to perceive as realistic. While some media players can extract and provide frame sequence data that may accompany digital video of certain formats, how this may be done, if at all, is not universal across all of the various media players and video formats. As such, content producers are left with a deficiency of control over how the content they have created will ultimately be experienced by a user. Even an approximation of the frame based on playback time as measured by the media player can produce poor results. For a projection switch done even a few frames earlier or later than the frame at which the switch should precisely have happened the small series of mismatched frames can be jarring enough to disturb the user.
It has been proposed to embed visual timecodes as non-image data into the frames themselves to uniquely identify the frames. With this approach, upon decoding and instead of relying on playback time, the media player can process the frames to read the visual timecodes and thereby be aware of exactly which frame has been decoded. However, since such visual timecodes are actually integrated as part of the digital video itself, absent some additional processing the media player will naturally display them along with the rest of the digital video content such that the user will see them. For traditional flat film, where visual timecodes may be positioned in the uppermost or lowermost region of the frame, such additional processing may include the frame being cropped and/or stretched to remove the visual time code, prior to being inserted into a frame buffer for display. Such cropping or minor stretching of a frame of flat film does not distort the frame and the user is typically unaware that anything has been done.
However, it is problematic to attempt simply cropping or stretching an equirectangular frame to remove a visual timecode in the same way. Such a modification to an equirectangular frame would manifest itself, upon mapping to a sphere, as significant and noticeable distortion. Because of this, it is difficult to hide a visual timecode in such a frame. The visual timecode may be positioned in the equirectangular frames such that it will be mapped to a position near to the zenith (top) or nadir (bottom) of the sphere. This will cause it to be squeezed and distorted thereby reducing the likelihood that it will be noticed. However, it will still remain as an undesirable and disturbing visual artifact.
When viewing 360 video on a desktop computer, a mobile device or the like, it may be acceptable to programmatically limit the degree of freedom available to the user, so the user is simply unable to adjust his or her field of view to encompass the area where the visual timecode is positioned. However, particularly in the virtual reality context, where the user using a HMD can adjust his or her field of view by physically moving his or her head or body, it is not possible to limit the user's movement physically. Furthermore, in the virtual reality context, masking out the visual timecode would create another visual artifact and would prevent showing the 360-degree sphere in its entirety, greatly detracting from the immersive quality of the medium.
It has been proposed to automatically replace the pixels of a visual timecode, once it has been extracted from the frame, with pixels that better match the surrounding, non-timecode content. However, algorithms for detecting surrounding pixel colours and their intensities for overwriting the timecode pixels generally can produce noticeable visual artifacts. Such algorithms also tend to be processor-intensive, consuming valuable computing resources thereby having a detrimental effect on the playback frame rate. This can be particularly problematic for applications in which a higher frame rate is important for user experience, such as in virtual reality.