Published video coding standards include ITU-T H.261, ITU-T H.263, ISO/IEC MPEG-1, ISO/IEC MPEG-2, and ISO/IEC MPEG-4 Part 2. These standards are herein referred to as conventional video coding standards.
There is a standardization effort going on in a Joint Video Team (JVT) of ITU-T and ISO/IEC. The work of JVT is based on an earlier standardization project in ITU-T called H.26L. The goal of the JVT standardization is to release the same standard text as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10 (MPEG-4 Part 10). The draft standard is referred to as the JVT coding standard in this application, and the codec according to the draft standard is referred to as the JVT codec.
Video Communication Systems
Video communication systems can be divided into conversational and non-conversational systems. Conversational systems include video conferencing and video telephony. Examples of such systems include ITU-T Recommendations H.320, H.323, and H.324 that specify a video conferencing/telephony system operating in ISDN, IP, and PSTN networks respectively. Conversational systems are characterized by the intent to minimize the end-to-end delay (from audio-video capture to the far-end audio-video presentation) in order to improve the user experience.
Non-conversational systems include playback of stored content, such as Digital Versatile Disks (DVDs) or video files stored in a mass memory of a playback device, digital TV, and streaming.
In the following, some terms relating to video information are defined for clarity. A frame contains an array of luma samples and two corresponding arrays of chroma samples. A frame consists of two fields, a top field and a bottom field. A field is an assembly of alternate rows of a frame. A picture is either a frame or a field. A coded picture is either a coded field or a coded frame. In the JVT coding standard, a coded picture consists of one or more slices. A slice consists of an integer number of macroblocks, and a decoded macroblock corresponds to a 16×16 block of luma samples and two corresponding blocks of chroma samples. In the JVT coding standard, a slice is coded according to one of the following coding types: I (intra), P (predicted), B (bi-predictive), SI (switching intra), SP (switching predicted). A coded picture is allowed to contain slices of different types. All types of pictures can be used as reference pictures for P, B, and SP slices. The instantaneous decoder refresh (IDR) picture is a particular type of a coded picture including only slices with I or SI slice types. No subsequent picture can refer to pictures that are earlier than the IDR picture in decoding order. In some video coding standards, a coded video sequence is an entity containing all pictures in the bitstream before the end of a sequence mark. In the JVT coding standard, a coded video sequence is an entity containing all coded pictures from an IDR picture (inclusive) to the next IDR picture (exclusive) in decoding order. In other words, a coded video sequence according to the JVT coding standard corresponds to a closed group of pictures (GOP) according to MPEG-2 video.
Conventional video coding standards have specified a structure for an elementary bitstream, i.e., a self-containing bitstream that decoders can parse. The bitstream has consisted of several layers, typically including several of the following: a sequence layer, a group of pictures (GOP) layer, a picture layer, a slice layer, a macroblock layer, and a block layer. The bitstream for each layer typically comprises a header and associated data.
The codec specification itself distinguishes conceptually between a video coding layer (VCL), and the network abstraction layer (NAL). The VCL contains the signal processing functionality of the codec, things such as transform, quantization, motion search/compensation, and the loop filter. It follows the general concept of most of today's video codecs, a macroblock-based coder that utilizes inter picture prediction with motion compensation, and transform coding of the residual signal. The output of the VCL are slices: a bit string that contains the macroblock data of an integer number of macroblocks, and the information of the slice header (containing the spatial address of the first macroblock in the slice, the initial quantization parameter, and similar). Macroblocks in slices are ordered in scan order unless a different macroblock allocation is specified, using the so-called Flexible Macroblock Ordering syntax. In-picture prediction is used only within a slice.
The NAL encapsulates the slice output of the VCL into Network Abstraction Layer Units (NALUs), which are suitable for the transmission over packet networks or the use in packet oriented multiplex environments. All NAL units relating to a certain picture form an access unit. JVT's Annex B defines an encapsulation process to transmit such NALUs over byte-stream oriented networks. A stream of NAL units does not form an elementary bitstream as such because there are no start codes in NAL units, but rather NAL units have to be framed with start codes according to Annex B of the JVT coding standard to form an elementary bitstream.
The optional reference picture selection mode of H.263 and the NEWPRED coding tool of MPEG-4 Part 2 enable selection of the reference frame for motion compensation per each picture segment, e.g., per each slice in H.263. Furthermore, the optional Enhanced Reference Picture Selection mode of H.263 and the JVT coding standard enable selection of the reference frame for each macroblock separately.
Parameter Set Concept
The JVT coding standard contains headers at slice layer and below, but it does not include picture, GOP, or sequence headers. Instead, a concept of a parameter set, introduced in ITU-T document VCEG-N55, replaces such headers. An instance of a parameter set includes all picture, GOP, and sequence level data such as picture size, display window, optional coding modes employed, macroblock allocation map, and others. Each parameter set instance includes a unique identifier. Each slice header includes a reference to a parameter set identifier, and the parameter values of the referred parameter set are used when decoding the slice. Parameter sets decouple the transmission and decoding order of infrequently changing picture, GOP, and sequence level data from sequence, GOP, and picture boundaries. Parameter sets can be transmitted out-of-band using a reliable transmission protocol as long as they are decoded before they are referred. If parameter sets are transmitted in-band, they can be repeated multiple times to improve error resilience compared to conventional video coding schemes. Preferably the parameter sets are transmitted at a session set-up time. However, in some systems, mainly broadcast ones, reliable out-of-band transmission of parameter sets is not feasible, but rather parameter sets are conveyed in-band in Parameter Set NAL units.
In order to be able to change picture parameters (such as the picture size), without having the need to transmit Parameter Set updates synchronously to the slice packet stream, the encoder and decoder can maintain a list of more than one Parameter Set. Each slice header contains a codeword that indicates the Parameter Set to be used.
This mechanism allows decoupling of the transmission of the Parameter Sets from the packet stream, and transmit them by external means, e.g. as a side effect of the capability exchange, or through a (reliable or unreliable) control protocol. It may even be possible that they are never transmitted but are fixed by an application design specification.
There are some disadvantages with pre-defined parameter sets. First, if there is a need to transmit many parameter set instances in the beginning of a session, the out-of-band method may become overburdened or the beginning latency of the session will be too long. Second, in systems lacking feasible mechanisms for reliable out-of-band transmission of parameter sets, in-band transport of Parameter Set NAL units is not reliable. Third, for broadcast applications, since the parameter sets information should be transmitted frequently to allow new users join during the broadcast process, redundant transmission of all the active parameter set instances is costly from bit-rate point of view.
Transmission of Multimedia Streams
A multimedia streaming system consists of a streaming server and a number of players, which access the server via a network. The network is typically packet-oriented and provides little or no means to guarantee quality of service. The players fetch either pre-stored or live multimedia content from the server and play it back in real-time while the content is being downloaded.
The type of communication can be either point-to-point or multicast. In point-to-point streaming, the server provides a separate connection for each player. In multicast streaming, the server transmits a single data stream to a number of players, and network elements duplicate the stream only if it is necessary.
When a player has established a connection to a server and requested for a multimedia stream, the server begins to transmit the desired stream. The player does not start playing the stream back immediately, but rather it typically buffers the incoming data for a few seconds. Herein, this buffering is referred to as initial buffering. Initial buffering helps to maintain pauseless playback, because, in case of occasional increased transmission delays or network throughput drops, the player can decode and play buffered data.