As processing and storage technologies continue to improve, many personal computing systems (e.g., personal computers, set-top boxes, etc.) now have the capacity to receive, process and render multimedia objects. Such objects have multimedia content that includes a combination of audio, graphical, and/or video content. The multimedia content may be delivered to the computing system in any of a number of ways including, for example, on a compact disk read-only memory (CD-ROM), on a digital versatile disk read-only memory (DVD-ROM), via a communicatively coupled data network (e.g., Internet), and the like.
Due to the amount of data required to accurately represent such multimedia content, it is typically delivered to the computing system in an encoded, compressed form. To reproduce the original content for presentation, the multimedia content must be decompressed and decoded before it is presented. Here, presenting includes communicating the multimedia content to a display and/or audio device.
A number of multimedia standards have been developed that define the format and meaning of encoded multimedia content for purposes of distribution. Organizations such as the Moving Picture Experts Group (MPEG) under the auspices of the International Standards Organization (ISO) and International Electrotechnical Commission (IEC), and the Video Coding Experts Group (VCEG) under the auspices of the International Telecommunications Union (ITU), have developed a number of multimedia coding standards (e.g., MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and the like).
Simplistically speaking, the encoding process removes spatial and temporal redundancies from the media content, thereby reducing the amount of data needed to represent the media content and, as a result, reducing the bandwidth burden to store and/or transmit such media content. Examples of the encoding process include entropy decoding, motion compensated prediction, inverse quantization, inverse transformation, and addition of the inverse transformed results to the prediction.
Conversely, the decoding process is, simplistically speaking, typically the inverse of the encoding process.
Rendering typically includes an additional step of digital to analog conversion (with filtering). That generates an approximate representation of the original analog media signal.
Herein, a “media stream” is a multimedia object (containing audio and/or visual content) that is compressed and encoded in accordance with generally available mechanisms for doing so. Furthermore, such a media stream is intended to be decoded and rendered in accordance with generally available mechanisms for doing so.
Without a loss of generality, the same techniques can be applied to any media stream that has a similar structure which reduces temporal or spatial redundancies. For example, many audio compression formats have keyframes followed by modification data to regenerate an approximation of the original uncompressed stream.
There are many different video-stream data formats. For example: H.263, MPEG-1, MPEG-2, MPEG-4 Visual, H.264/AVC, and DV formats.
There are many different audio-stream data formats. For example: DTS audio or MLP audio.
MPEG-2/H.262
The predominant digital video compression and transmission formats are from a family called block-based motion-compensated hybrid video coders, as typified by the ISO/IEC MPEG-X (Moving Picture Experts Group) and ITU-T VCEG H.26X (Video Coding Experts Group) standards. This family of standards is used for coding audio-visual information (e.g., movies, video, music, and such) in a digital compressed format.
For the convenience of explanation, the MPEG-2 video stream (also known as an H.262 video stream) is generally discussed and described herein, as it has a structure that is typical of conventional video coding approaches. However, those who are skilled in the art understand and appreciate that other such digital video compression and transmission formats exist and may be used.
The MPEG-2 format may be referred to as a generally “forward decoding” format. An example representation of a MPEG format is shown in FIG. 1 generally at 10. Each video sequence is composed of a sequence of frames that is typically called Groups of Pictures (or “GOPs”). A GOP is composed of a sequence of pictures or frames. The GOP data is compressed as a sequence of I-, P- and B-frames where:                An I-frame (i.e., intra-frame) is an independent starting image—(compressed in a similar format to a JPEG image). An I-frame or “key frame” (such as I-frame 12) is encoded as a single image, with no reference to any past or future frames. An I-frame is considered a “reference frame” in MPEG-2, as its content can be used in the decoding process for one subsequent P-frame or multiple subsequent B-frames in decoding order.        A P-frame (i.e., forward predicted frame) is computed by moving around rectangles (called macroblocks) from the previous I- or P-frame then (if so indicated by the encoder) applying a ‘correction’ called a residual. Subsequent P-frames (such as P-frame 18) is encoded relative to the past reference frame (such as a previous I- or P-frame). P-frames can also be considered as “delta frames” in that they contain changes relative to their reference frame. A P-frame is also considered a “reference frame” in MPEG-2, as its content can be used in the decoding process for one subsequent P-frame or multiple subsequent B-frames in decoding order.        Zero or more B-frames (i.e., bi-directional predicted frames, such as frames 14 and 16) are formed by a combination of rectangles from the adjacent I- or P-frames, followed (if so indicated by the encoder) by a correction factor. Several B-frames may lie between a pair of reference frames (frames that are either I- or P-frames). In MPEG-2, B-frames are not called reference frames, as they are not used as references for the decoding of subsequent frames in decoding order.        
The GOP structure is intended to assist random access into the stream. A GOP is typically an independently decodable unit that may be of any size as long as it begins with an I-frame.
One problem associated with the MPEG-2 format pertains to being able to play back the data in the reverse of the ordinary display order. Playing the data forward is typically not a problem because the format itself is forward decoding—meaning that one must typically decode the I-frame first and then move on to the other frames in the GOP. Playing back the data in reverse, however, is more challenging because the GOPs inherently resist a straightforward backward-decoding.
Similar challenges exist to audio data which are compressed as a starting vector of values (i.e. one per audio channel) followed by delta frames.
DVD
Normally, when images are recorded on a disk, such as a DVD, the content is actually broken into small units covering a pre-determined time period (typically approximately ½-second units or video object basic units (“VOBUs”)). The advantage of this format is that when you play the video, you can progress through the video units one by one. If one wants to jump to an arbitrary piece of video, one can simply jump to the video unit of interest and the audio and video will be synchronized. The location at which all streams are synchronized is referred to as a “clean point”. Accordingly, when the video and audio units are compressed, they are compressed in a unit that is to be rendered at the exact same time—that is, there is no skew between the audio and video.
All references to I-frames, when discussed within the MPEG-2 context may be extended to key-frames in other data formats. The term I-frame is synonymous with a key-frame when discussed outside of the MPEG-2 context.
Exemplary Media-Stream Rendering System
FIG. 2 illustrates an exemplary system 200 that can render data from a media stream source, such as a DVD. System 200 includes an application 202 that communicates with a source component 204 that reads data off of a DVD 206. The data that is read off of the DVD includes audio and video data that has been encoded and multiplexed together.
As the source reads the data off of the DVD, it retrieves timestamps from the data packets, which are then used to synchronize and schedule the packets for rendering. The packets are then provided to a demultiplexer (or “demux”) 208 which splits the packets into different constituent portions—audio, video and, if present, subpicture packets.
The packets are then provided by the demultiplexer to an associated decoder, such as video decoder 210 (for decoding video packets), audio decoder 212 (for decoding audio packets) and subpicture decoder 214 (for decoding subpicture packets). Each one of the packets has associated timing information, which defines when the contents of the packet are supposed to be rendered. These packets may be a GOP (as described above with regard to MPEG).
The decoders then decompress their associated packets and send the individual data samples or packets (including the packets' timestamps) to the appropriate renderers, such as video renderer 216 and audio renderer 218. Each of these decoders typically has a cache for temporarily storing decoded packets (or portions thereof). Typically, a cache is at least large enough to accommodate frame decoding that reference data from other frames.
System 200 also typically includes a global clock 220 that is used by the various renderers to ascertain when to render certain data samples whose timestamps coincide with a time indicated by the global clock.
Reverse Playback
Assume now that a user indicates, via application 202, that she wishes to view the content in reverse order. This may be called “reverse playback,” “backwards play”, “rewind,” “backwards scan,” “reverse trick play,” or “reverse scan.”
The frames of a GOP are designed to be decoded and presented in generally the same direction, which will be called the “forward” direction herein. That is the same direction in which the frames of the GOP are encoded. However, the actual specific order that, the frames of a GOP are encoded typically differs from the actual specific order of their presentation.
To decode a B-frame (such as frames 14 and 16 of FIG. 1), the previous I-/P-frame and the next I-/P-frame must already be present. For example, a GOP may be presented in this order I1B2B3P4B5P6, but it would be encoded in the order I1P4B2B3P6B5. Note that P4 must be decoded before B2 and B3 may be generated. P4 must be decoded in order to generate P6.
Consequently, simply reversing the decoding order is insufficient to produce reverse playback of the GOP. In the above example, the P6 depends on the previous P4 frame, which hasn't yet been decoded, if decoding occurs backwards. Furthermore, the P4 depends on the previous I1 frame, which also hasn't yet been decoded, if decoding occurs backwards. Further still, the B-frames depend on one or more frames that haven't been decoded yet when decoding in reverse.
In light of this, one conventional approach to the “reverse playback” of a GOP involves a reverse presentation of the frames in the GOP so that before each frame presentation, all of the frames (or at least all of the reference frames) that proceed the current frame are decoded. For example, for frames labeled ABCDE, this conventional approach decodes frames ABCDE, and then displays frame E. Next, it decodes frames ABCD, and then displays D. Next, it decodes ABC, and then displays C, and so forth.
This conventional approach is highly time-consuming and it is very inefficient since the system is decoding some of the same frames repeatedly. The computations requirements are very high, but memory requirements are relatively low.
Another conventional approach to the “reverse playback” of a GOP is to decode forward as normal but temporarily store all of the decoded and now uncompressed frames of the GOP. Once the entire GOP is decoded and stored, the decoded frames are passed on to the renderers in reverse order. Therefore, stored output frames are simply displayed in reverse order.
With this approach, the computations requirements are relatively low, but memory requirements are relatively high. This conventional approach requires:                A large amount of cache memory to cache the uncompressed images of the decoded frames of the GOP. Even for common formats such as DVD, or standard definition TV, 10 MB or more of cache memory may be necessary. Higher HDTV resolutions typically require 50 MB of memory.        The cache memory is typically located in high-speed memory that is accessible to the decoder unit. The random rectangle extractions produce very cache-unfriendly accessing patterns. Typically, this memory will need to be in the typically expensive and limited local video memory (VRAM) on a video card.        Since the GOPs typically need to be pipelined using at least double caching to ensure a constant output speed, the memory requirements of the cache memory are typically doubled. While the overall decoding speed for a block in reverse is the same as one played forwards at the same speed, the decoder will need to almost instantly decode the block, and then play it in reverse at normal speed. By having a fully decoded GOP in memory, the decoder unit has a GOP worth of presentation time to decode the next GOP.        
Furthermore, post-processing of the content (such as de-interlacing, scaling, filtering, audio pitch correction, etc.) may require additional computational power and temporary processing caches.
In recognition of the difficulties involved in reverse playback, some conventional approaches simply decode and display only the key-frames (e.g., I-frames) from each GOP. This produces a jerky slide-show-like reverse presentation of still images appearing in the video stream. While straightforward, this simplistic approach is ungraceful and utterly fails at simulating reverse motion of the video content of the video stream.