The present invention relates to an apparatus and method for synchronizing the decoding and display (e.g., presentation) of a stereoscopic video sequence. In particular, a system for determining a presentation time stamp and decoding time stamp of an enhancement layer is presented, in addition to a corresponding optimal bitstream transmission ordering which minimizes the required decoder input buffer size.
Digital technology has revolutionized the delivery of video and audio services to consumers since it can deliver signals of much higher quality than analog techniques and provide additional features that were previously unavailable. Digital systems are particularly advantageous for signals that are broadcast via a cable television network or by satellite to cable television affiliates and/or directly to home satellite television receivers. In such systems, a subscriber receives the digital data stream via a receiver/descrambler that decompresses and decodes the data in order to reconstruct the original video and audio signals. The digital receiver includes a microcomputer and memory storage elements for use in this process.
The need to provide low cost receivers while still providing high quality video and audio requires that the amount of data which is processed be limited. Moreover, the available bandwidth for the transmission of the digital signal may also be limited by physical constraints, existing communication protocols, and governmental regulations. Accordingly, various intra-frame data compression schemes have been developed that take advantage of the spatial correlation among adjacent pixels in a particular video picture (e.g., frame).
Moreover, inter-frame compression schemes take advantage of temporal correlations between corresponding regions of successive frames by using motion compensation data and block-matching motion estimation algorithms. In this case, a motion vector is determined for each block in a current picture of an image by identifying a block in a previous picture which most closely resembles the particular current block. The entire current picture can then be reconstructed at a decoder by sending data which represents the difference between the corresponding block pairs, together with the motion vectors that are required to identify the corresponding pairs. Block matching motion estimating algorithms are particularly effective when combined with block-based spatial compression techniques such as the discrete cosine transform (DCT).
Additionally, there has been increasing interest in proposed stereoscopic video transmission formats such as the Motion Picture Experts Group (MPEG) MPEG-2 Multi-view Profile (MVP) system, described in document ISO/IEC JTC1/SC29/WG11 N1088, entitled "Proposed Draft Amendment No. 3 to 13818-2 (Multi-view Profile)," November, 1995, incorporated herein by reference. Stereoscopic video provides slightly offset views of the same image to produce a combined image with greater depth of field, thereby creating a three-dimensional (3-D) effect. In such a system, dual cameras may be positioned about two inches apart to record an event on two separate video signals. The spacing of the cameras approximates the distance between left and right human eyes. Moreover, with some stereoscopic video camcorders, the two lenses are built into one camcorder head and therefore move in synchronism, for example, when panning across an image. The two video signals can be transmitted and recombined at a receiver to produce an image with a depth of field that corresponds to normal human vision. Other special effects can also be provided.
The MPEG MVP system includes two video layers which are transmitted in a multiplexed signal. First, a base (e.g., lower) layer represents a left view of a three dimensional object. Second, an enhancement (e.g., auxiliary, or upper) layer represents a right view of the object. Since the right and left views are of the same object and are offset only slightly relative to each other, there will usually be a large degree of correlation between the video images of the base and enhancement layers. This correlation can be used to compress the enhancement layer data relative to the base layer, thereby reducing the amount of data that needs to be transmitted in the enhancement layer to maintain a given image quality. The image quality generally corresponds to the quantization level of the video data.
The MPEG MVP system includes three types of video pictures; specifically, the intra-coded picture (I-picture), predictive-coded picture (P-picture), and bi-directionally predictive-coded picture (B-picture). Furthermore, while the base layer accommodates either frame or field structure video sequences, the enhancement layer accommodates only frame structure. An I-picture completely describes a single video picture without reference to any other picture. For improved error concealment, motion vectors can be included with an I-picture. An error in an I-picture has the potential for greater impact on the displayed video since both P-pictures and B-pictures in the base layer are predicted from I-pictures. Moreover, pictures in the enhancement layer can be predicted from pictures in the base layer in a cross-layer prediction process known as disparity prediction. Prediction from one frame to another within a layer is known as temporal prediction.
In the base layer, P pictures are predicted based on previous I or P pictures. The reference is from an earlier I or P picture to a future P-picture and is known as forward prediction. B-pictures are predicted from the closest earlier I or P picture and the closest later I or P picture.
In the enhancement layer, a P-picture can be predicted from (a) the most recently decoded picture in the enhancement layer, (b) the most recent base layer picture, in display order, or (c) the next lower layer picture, in display order. Case (b) is used usually when the most recent base layer picture, in display order, is an I-picture. Moreover, a B-picture in the enhancement layer can be predicted using (d) the most recent decoded enhancement layer picture for forward prediction, and the most recent lower layer picture, in display order, for backward prediction, (e) the most recent decoded enhancement layer picture for forward prediction, and the next lower layer picture, in display order, for backward prediction, or (f) the most recent lower layer picture, in display order, for forward prediction, and the next lower layer picture, in display order, for backward prediction. When the most recent lower layer picture, in display order, is an I-picture, only that I-picture will be used for predictive coding (e.g., there will be no forward prediction).
Note that only prediction modes (a), (b) and (d) are encompassed within the MPEG MVP system. The MVP system is a subset of MPEG temporal scalability coding, which encompasses each of modes (a)-(f).
In one optional configuration, the enhancement layer has only P and B pictures, but no I pictures. The reference to a future picture (i.e., one that has not yet been displayed) is called backward prediction. Note that no backward prediction occurs within the enhancement layer. Accordingly, enhancement layer pictures are transmitted in display order. There are situations where backward prediction is very useful in increasing the compression rate. For example, in a scene in which a door opens, the current picture may predict what is behind the door based upon a future picture in which the door is already open.
B-pictures yield the most compression but also incorporate the most error. To eliminate error propagation, B-pictures may never be predicted from other B-pictures in the base layer. P-pictures yield less error and less compression. I-pictures yield the least compression, but are able to provide random access.
Thus, in the base layer, to decode P pictures, the previous I-picture or P-picture must be available. Similarly, to decode B pictures, the previous P or I and future P or I pictures must be available. Consequently, the video pictures are encoded and transmitted in dependency order, such that all pictures used for prediction are coded before the pictures predicted therefrom. When the encoded signal is received at a decoder, the video pictures are decoded and re-ordered for display. Accordingly, temporary storage elements are required to buffer the data before display. However, the need for a relatively large decoder input buffer increases the cost of manufacturing the decoder. This is undesirable since the decoders are mass-marketed items that must be produced at the lowest possible cost.
Additionally, there is a need to synchronize the decoding and presentation of the enhancement and base layer video sequences. Synchronization of the decoding and presentation process for stereoscopic video is a particularly important aspect of MVP. Since it is inherent in stereoscopic video that two views are tightly coupled to one another, loss of presentation or display synchronization could cause many problems for the viewer, such as eye strain, headaches, and so forth.
Moreover, the problems in dealing with this issue for digital compressed bitstreams are different from those for uncompressed bitstreams or analog signals such as those conforming to the NTSC or PAL standards. For example, with NTSC or PAL signals, the pictures are transmitted in a synchronous manner, so that a clock signal can be derived directly from the picture synch. In this case, synchronization of two views can be achieved easily by using the picture synch.
However, in a digital compressed stereoscopic bitstream, the amount of data for each picture in each layer is variable, and depends on the bit rate, picture coding types and complexity of the scene. Thus, decoding and presentation timing can not be derived directly from the start of picture data. That is, unlike analog video transmissions, there is no natural concept of synch pulses in a digital compressed bitstream.
Accordingly, it would be advantageous to provide a system for synchronizing the decoding and presentation of a stereoscopic video sequence. The system should also be compatible with decoders that decode pictures either sequentially (e.g. one picture at a time) or in parallel (e.g., two pictures at time). Moreover, the system should provide an optimal picture transmission order that minimizes the required decoder input buffer size. The present invention provides a system having the above and other advantages.