The subject matter disclosed in this application relates to a personal video recorder and, in particular, to a method and apparatus for improving trick playback operation of a personal video recorder.
A television programming provider typically produces a continuous set of programming signals (also known as “network feeds”) for distribution by a service provider over a video transmission network to a wide audience of viewers. Conventionally, the programming signal begins as an uncompressed video sequence and at least one corresponding uncompressed audio sequence. The uncompressed video sequence consists of a series of sequential pictures and is assembled at a production facility. After assembly, the uncompressed video sequence is compressed by a video encoder, which encodes each picture and creates a corresponding coded picture (also known as an access unit). Any corresponding audio sequences are compressed by an audio encoder. The coded audio and video sequences are transmitted over the transmission network to customer premises at which the audio and video sequences for a selected program are decoded and presented to the viewer.
ISO/IEC 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC), commonly referred to as H.264/AVC, prescribes a standard for coding image data for transmission and storage. H.264/AVC defines a frame as containing an array of luma samples and two corresponding arrays of chroma samples and as being composed of two fields, a top field and a bottom field. A 16×16 block of luma samples and two corresponding blocks of chroma samples is referred to as a macroblock. A picture (a generic term for a field or a frame) is partitioned into slice groups and each slice group contains one or more slices, each of which in turn contains an integer number of macroblocks.
H.264/AVC defines an I slice, a P slice and a B slice. Each slice is encoded as blocks of transform coefficients. The definition of an I slice in H.264/AVC is generally accepted as meaning a slice that is decoded using prediction only from decoded samples within the same slice, i.e. an I slice is self-contained. Similarly, under H.264/AVC, a P slice is a slice that may be decoded using prediction from decoded samples within the same slice or from decoded samples of at most one previously decoded reference picture using at most one motion vector and reference index to predict the sample values of each block. Thus, each block of transform coefficients in a P slice relies on only one previously decoded reference picture. And under the generally accepted interpretation of H.264/AVC, a B slice is a slice that may be decoded using prediction from decoded samples within the same slice or from decoded samples of at least one reference picture using at most two motion vectors and reference indices to predict the samples of each block. Thus, each block of transform coefficients in a B slice may rely on two reference pictures. Although any block of a P slice relies on only one reference picture, different blocks in a given P slice may rely on different reference pictures. Similarly, although any block of a B slice may rely on only two reference pictures, different blocks in a given B slice may rely on different reference pictures. Each slice has a slice header containing a slice_type syntax element, indicating whether the slice is an I slice, a P slice or a B slice, and a reference picture list indicating the pictures, if any, on which the slice relies for decoding.
A picture that contains only I slices may be referred to as an I picture. Similarly, a picture that contains only I slices and P slices may be referred to as a P picture and a picture that contains one or more B slices may be referred to as a B picture. H.264/AVC allows I, P and B pictures to be used as reference pictures.
The image information in each picture is represented by data contained in one or more Network Abstraction Layer (NAL) units. There are two types of NAL units, namely Video Coding Layer (VCL) NAL units and non-VCL NAL units. The subject matter of this application relates to the VCL NAL units and accordingly subsequent references to NAL units should be interpreted as referring to VCL NAL units. A NAL unit is a packet having an integer number of bytes and contains the image information for one slice. The first byte of a NAL unit is a header that contains a two-bit syntax element nal_ref_idc. H.264/AVC specifies that nal_ref_idc is zero for a slice that is part of a non-reference picture and is not equal to zero for a slice of a reference picture, and that when nal_ref_idc is equal to zero for one slice of a particular picture, it shall be equal to zero for all slices of that picture. Thus, for any given picture, the nal_ref_idc values for all the slices are zero or all are non-zero. Accordingly, it is meaningful to refer to a picture for which nal_ref_idc=0 and to a picture for which nal_ref_idc≠0. Although H.264/AVC does not use the terms “reference slice” and “non-reference slice,” it is convenient to use these terms to refer, respectively, to a slice for which nal_ref_idc≠0 and a B slice for which nal_ref_idc=0.
An AVC encoder receives an input frame for encoding and generates a bitstream representing, for each slice, the slice header and a set of transform coefficients. The mode of operation of a suitable AVC encoder is well understood by those skilled in the art. The bitstream generated by the AVC encoder is passed to a network abstraction layer, which forms the NAL units with the required syntax elements (including the nal_ref_idc bits) at the proper location (NAL header) in the NAL units.
Signals encoded using H.264/AVC are widely used for distributing television program material over various types of networks, including cable, IP TV and satellite using various protocols for encapsulating the NAL units. For example, Internet protocol is used for IP TV whereas the MPEG-2 transport stream (as defined in ISO 13818-1) is used in cable and satellite networks as a robust means for delivering a signal encoded in accordance with H.264/AVC. An MTS that delivers just one program (video and associated audio) is referred to as a single program transport stream (SPTS) whereas an MTS that delivers more than one program is referred to as a multi-program transport stream (MPTS).
In the case of an MTS based distribution system, the network abstraction layer places the NAL units in a video packetized elementary stream (video PES) and supplies the video PES to an MPEG-2 transport stream (MTS) layer. The MTS layer includes a multiplexer that selects the video PES and an associated audio PES, and video and audio PESs of other programs, in the sequence that is required in order to form MPTS packets.
The data bits of the MPTS packets are used to encode a signal for transmission over a channel to a receiver at which the data bits are recovered from the received signal and passed to an MTS layer which parses the bitstream and selects the video PES and audio PES of a desired program and supplies the video PES packets to an AVC decoder and the audio PES packets to an audio decoder. The AVC decoder includes a network abstraction layer that extracts the NAL units from the video PES packets. The AVC decoder calculates a set of transform coefficients from the NAL unit bitstream and processes the transform coefficients and any motion vectors in inverse fashion to the operations in the AVC encoder to create a decoded frame corresponding to the input frame that was presented for encoding. The decoded frame is loaded into a video display buffer. Decoded frames are read from the display buffer at the proper constant rate and are presented for display at the output of the AVC decoder.
The AVC decoder includes a decoder buffer for temporarily storing reference slices so that they will be available for decoding later dependent slices. The nal_ref_idc value allows the AVC decoder to determine readily whether a particular slice should be stored (nal_ref_idc≠0) or may be discarded (nal_ref_idc=0).
It is conventional to organize a sequence of pictures as a GOP, or group of pictures, having a repeating structure of I, P and B pictures. In implementations of the MPEG-2 standard, the GOP may comprise 12 pictures in the sequence IBBPBBPBBPBB (or 15 pictures in the sequence IBBPBBPBBPBBPBB) whereas implementations of H.264/AVC may employ a hierarchical GOP structure in the form IBBBPBBBP etc. or IBBBBBBBPBBBBBBBP etc., depending on whether the decoder stores one or two B pictures. The picture at the beginning of the GOP is sometimes an instantaneous decoding refresh (IDR) picture, or an I or P picture. Accordingly, a GOP is usually self contained: a picture in an earlier GOP usually does not serve as reference for a picture in a later GOP. H.264/AVC does not differentiate among reference slices based on the non-zero value of the nal_ref_idc syntax element. IP based systems sometimes use the three available non-zero values of the nal_ref_idc syntax element to signal a priority level for the NAL units so that IP packets containing NAL units with nal_ref_idc=3 are handled with a higher priority than those containing NAL units with nal_ref_idc values equal to 2 or 1. It has also been proposed that a scalable video coding (SVC) extension of H.264/AVC should employ the non-zero values of the nal_ref_idc syntax element to distinguish among temporal levels of pictures. MTS based applications do not currently use the non-zero values of the nal_ref_idc syntax element to differentiate the handling of NAL units for reference slices.
Many subscribers to cable and satellite television distribution services use PVRs (personal video recorders) to record television program material for later playback and viewing. In this case, the signals are stored in coded form and are played back when desired and decoded in similar manner to the stand alone decoder described above.
In normal operation of the PVR, the video and audio PES packets for a selected program are temporarily saved in a suitable memory device, such as a hard disk drive. When a saved program is selected for viewing by the user, the audio and video PES packets are read from the memory device. The video PES packets are supplied to the AVC decoder and the audio PES packets are supplied to an audio decoder, as described above. The AVC decoder supplies the decoded frames to the video display buffer and the frames are read from the display buffer for presentation to the viewer.
A typical PVR supports various trick playback modes, including fast forward (FF) and rapid reverse (RR), which allow a viewer to scan rapidly through material of little interest. The PVR accomplishes FF and RR playback by discarding pictures of the received sequence, i.e. by omitting pictures of the received sequence from the sequence that is decoded and supplied to the video display buffer. The PVR displays pictures at the normal constant rate (i.e. about 30 frames per second in the United States) but since pictures of the received sequence are discarded, the displayed image evolves at a greater speed than that in normal playback. For example, if the PVR discarded every other picture during FF playback, the displayed image would evolve at twice normal playback speed.
In FF playback, the pictures that are retained in the sequence are presented in the same order as in normal playback. In the case of reverse playback, further manipulation is necessary so that pictures received later in the sequence will be available for presentation before pictures that were received earlier.
In order to minimize degradation of the displayed image during FF or RR playback, it is desirable that the discarded pictures not be reference pictures, since reference pictures are required to decode the dependent slices. This requirement can be applied readily to a signal encoded using MPEG-2 by discarding B pictures, because under MPEG-2, a B picture is not used as a reference picture. If all B pictures in an MPEG-2 sequence employing the standard GOP structure IBBPBB etc. were discarded, the FF or RR speed would be three times normal playback speed. In principle, this approach could be applied to a signal encoded using H.264/AVC, by discarding non-reference B pictures. In this case, no reference slices would be discarded and all dependent slices could be properly decoded. However, in a practical implementation of H.264/AVC the macroblocks in a B slice may refer to as many as five pictures including reference B pictures and therefore there may be relatively few pictures in a given H.264/AVC sequence for which nal_ref_idc=0. Accordingly, in order to achieve FF and RR playback, particularly at speeds from three to six times normal playback speed, it may be necessary to discard reference pictures and the displayed image may accordingly be degraded to an undesirable extent.