1. Field of the Invention
The present invention relates to video processing systems, and, in particular, to apparatuses and methods for encoding video sequences in a bitstream which is backward compatible for decoding into lower quality video by older decoders and which may be decoded into high-quality progressive video by newer decoders compatible with the high-quality encoding.
2. Description of the Related Art
Data signals are often subjected to computer processing techniques such as data compression or encoding, and data decompression or decoding. The data signals may be, for example, video signals. Video signals are typically representative of video pictures (images) of a motion video sequence. In video signal processing, video signals are digitally compressed by encoding the video signal in accordance with a specified coding standard to form a digital, encoded bitstream. An encoded video signal bitstream may be decoded to provide decoded video signals.
The term “frame” is commonly used for the unit of a video sequence. A frame contains lines of spatial information of a video signal. Depending on the encoding format, a frame may consist of one or more fields of video data. Thus, various segments of an encoded bitstream represent a given frame or field. The encoded bitstream may be stored for later retrieval by a video decoder, and/or transmitted to a remote video signal decoding system, over transmission channels or systems such as Integrated Services Digital Network (ISDN) and Public Switched Telephone Network (PSTN) telephone connections, cable, and direct satellite systems (DSS).
Video signals are often encoded, transmitted, and decoded for use in television (TV) type systems. Many common TV systems, e.g. in North America, operate in accordance with the NTSC (National Television Systems Committee) standard, which operates at (30*1000/1001)≈29.97 frames/second (fps). The spatial resolution of SDTV is sometimes referred to as SDTV (standard definition TV). NTSC originally used 30 fps to be half the frequency of the 60 cycle AC power supply system. It was later changed to 29.97 fps to throw it “out of phase” with power, to reduce harmonic distortions. Other systems, such as PAL (Phase Alternation by Line), are also used, e.g. in Europe.
In the NTSC system, each frame of data is typically composed of an even field interlaced or interleaved with an odd field. Each field consists of the pixels in alternating horizontal lines of the picture or frame. Accordingly, NTSC cameras output 29.97×2=59.94 fields of analog video signals per second, which includes 29.97 even fields interlaced with 29.97 odd fields, to provide video at 29.97 fps. NTSC images typically have a resolution of approximately 720 (h)×480 (v) active pixels. Thus, each field is 720×240, to provide interlaced frames of 720×480. These specifications are provided in CCIR Rec. 601, which specifies the image format, acquisition semantic, and parts of the coding for digital “standard” television signals. (“Standard” television is in the resolution of PAL, NTSC, and SECAM.)
Various video compression standards are used for digital video processing, which specify the coded bitstream for a given video coding standard. These standards include the International Standards Organization/International Electrotechnical Commission (ISO/IEC) 11172 Moving Pictures Experts Group-1 international standard (“Coding of Moving Pictures and Associated Audio for Digital Storage Media”) (MPEG-1), and the ISO/IEC 13818 international standard (“Generalized Coding of Moving Pictures and Associated Audio Information”) (MPEG-2). Another video coding standard is H.261 (P×64), developed by the International Telegraph Union (ITU). In MPEG, the term “picture” refers to a bitstream of data which can represent either a frame of data (i.e., both fields), or a single field of data. Thus, MPEG encoding techniques are used to encode MPEG “pictures” from fields or frames of video data.
MPEG-1 was built around the Standard Image Format (SIF) of 352×240 at 30 frames per second (fps). MPEG data rates are variable, although MPEG-1 was designed to provide VHS video quality at a data rate of 1.2 megabits per second, or 150 KB/sec. In the MPEG-1 standard, video is strictly non-interlaced (i.e. progressive). For progressive video, the lines of a frame contain samples starting from one time instant and continuing through successive lines to the bottom of the frame.
MPEG-2, adopted in the Spring of 1994, is a compatible extension to MPEG-1, which builds on MPEG-1 and also supports interlaced video formats and a number of other advanced features, including features to support HDTV (high-definition TV). MPEG-2 was designed, in part, to be used with NTSC-type broadcast TV sample rates using the CCIR Rec. 601 (720 samples/line by 480 lines per frame by 29.97 fps. In the interlacing employed by MPEG-2, a frame is split into two fields, a top field and a bottom field. One of these fields commences one field period later than the other. Each video field is a subset of the pixels of a picture transmitted separately. MPEG-2 is a video encoding standard which can be used, for example, in broadcasting video encoded in accordance with this standard. The MPEG standards can support a variety of frame rates and formats.
Motion compensation is commonly utilized in video signal processing. Motion compensation techniques exploit the temporal correlation that often exists between consecutive pictures, in which there is a tendency of some objects or image features to move within restricted boundaries from one location to another from picture to picture. In the MPEG standards, such as the MPEG-2 standard, there may be different picture or frame types in the compressed digital stream, such as I frames, P frames, and B frames. I frames, or intra-frames, are self-contained, that is, they are not based on information from previously transmitted and decoded frames. Video frames which are encoded with motion compensation techniques are referred to as predicted frames, or P frames, since their content is predicted from the content of previous I or P frames. P frames may also be utilized as a base for a subsequent P frame. I and P frames are both “anchor” frames, since they may be used as a basis for other frames, such as B or P frames which are predicted based on anchor frames. A “bidirectional” or B frame is predicted from the two anchor frames transmitted most recently relative to the transmission of the B frame. Other standards, such as H.261, utilize only I and P frames.
Most MPEG encoding schemes use a twelve- to fifteen-compressed frame sequence called a group of pictures (GOP). Each GOP typically begins with an I frame, and optionally includes a number of B and P frames. The parameter M is often used to represent the distance between P frames in a GOP, and the parameter N represents the total number of frames in a GOP (i.e., the distance between I frames in consecutive GOPs).
An MPEG bitstream typically contains one or more video streams multiplexed with one or more audio streams and other data, such as timing information. In MPEG-2, encoded data which describes a particular video sequence is represented in several nested layers: the Sequence layer, the GOP layer, the Picture layer, the Slice layer, and the Macroblock layer. To aid in transmitting this information, a digital data stream representing multiple video sequences is divided into several smaller units and each of these units is encapsulated into a respective packetized elementary stream (PES) packet. For transmission, each PES packet is divided, in turn, among a plurality of fixed-length transport packets. Each transport packet contains data relating to only one PES packet. The transport packet also includes a header which holds control information to be used in decoding the transport packet.
Thus, the basic unit of an MPEG stream is the packet, which includes a packet header and packet data. Each packet may represent, for example, a field of data. The packet header includes a stream identification code and may include one or more time-stamps. For example, each data packet may be over 100 bytes long, with the first two 8-bit bytes containing a packet-identifier (PID) field. In a DSS application, for example, the PID may be a SCID (service channel ID) and various flags. The SCID is typically a unique 12-bit number that uniquely identifies the particular data stream to which a data packet belongs. Thus, each compressed video packet contains a PID such as a SCID.
When an MPEG-2 encoded image is received by a video decoding system, a transport decoder decodes the transport packets to reassemble the PES packets. The PES packets, in turn, are decoded to reassemble the MPEG-2 bitstream which represents the image. A given transport data stream may simultaneously convey multiple image sequences, for example as interleaved transport packets.
For example, an MPEG-2 encoded video bitstream may be transported by means of DSS packets when DSS transmissions are employed. Most DSS video programs are encoded at 544 pixels/line and 480 lines/frame. All 29.97 frames/sec are coded. The exact number of coded frames/sec depends on the exact sequence. DSS systems allow users to receive directly many TV channels broadcast from satellites, with a DSS receiver. The receiver typically includes a small 18-inch satellite dish connected by a cable to an integrated receiver/decoder unit (IRD). The satellite dish is aimed toward the satellites, and the IRD is connected to the user's television in a similar fashion to a conventional cable-TV decoder. In the IRD, front-end circuitry receives a signal from the satellite and converts it to the original digital data stream, which is fed to video/audio decoder circuits which perform transport extraction and decompression. For MPEG-2 video, the IRD comprises an MPEG-2 decoder used to decompress the received compressed video.
In MPEG-2, four different “profiles” are defined, each corresponding to a different level of complexity of the encoded image, e.g. the image/picture resolution. Each profile define the colorspace resolution and scalability of the bitstream. For each profile, different levels are defined, each level corresponding to a different image resolution. The various levels for a given profile define the maximum and minimum for image resolution, and Y (luminance) samples per second, the number of video and audio layers supported for scalable profiles, and the maximum bit rate per profile. The combination of a profile and a level produces an architecture which defines the ability of a decoder to handle a particular bitstream.
The most common profile for broadcast applications is the main profile (MP) format. One of the MPEG-2 “standards,” known as Main Profile, Main (or Medium) Level (MP@ML) is intended for encoding video signals conforming to existing SD television standards (i.e., NTSC and PAL). This standard may be used to encode video images having 480 active lines each with 720 active pixels with a 2:1 interlace scan. When the horizontal and vertical blanking intervals are added to these signals, the result has 525 lines by 858 pixels. When they are decoded, and displayed with a 13.5 MHz display clock signal, these signals produce images that correspond to NTSC-type broadcast images. Another standard, known as Main Profile, High Level (MP@HL), is intended for encoding HDTV images.
As the quality of some systems such as TV systems improves, it is desirable to provide HD-compatible encoded video signals for video transmissions. However, there may be both SD and HD receivers, and the SD receivers and systems may not be compatible with the improved transmission/encoding standard, i.e. the improved standards may not be “backward-compatible”. For example, conventional SD DSS IRDs are not able to decode any formats better than MP@ML formats. Thus, some DSS systems are forced to transmit an HD channel of data, as well as an SD version of the HD channel, so that the DSS SD receivers can receive and decode the transmission. This is a very expensive solution since it takes a complete SD channel bandwidth in addition to the HD channel bandwidth. Bandwidth is wasted since redundant information is transmitted. There is a need, therefore, for techniques for encoding and transmitting improved or enhanced signals which are also backward compatible with the prior standard, to avoid having to transmit redundant or extra channels of data.