Published video coding standards include ITU-T H.261, ITU-T H.263, ISO/IEC MPEG-1, ISO/IEC MPEG-2, and ISO/IEC MPEG-4 Part 2. These standards are herein referred to as conventional video coding standards.
Video Communication Systems
Video communication systems can be divided into conversational and non-conversational systems. Conversational systems include video conferencing and video telephony. Examples of such systems include ITU-T Recommendations H.320, H.323, and H.324 that specify a video conferencing/telephony system operating in ISDN, IP, and PSTN networks respectively. Conversational systems are characterized by the intent to minimize the end-to-end delay (from audio-video capture to the far-end audio-video presentation) in order to improve the user experience.
Non-conversational systems include playback of stored content, such as Digital Versatile Disks (DVDs) or video files stored in a mass memory of a playback device, digital TV, and streaming. A short review of the most important standards in these technology areas is given below.
A dominant standard in digital video consumer electronics today is MPEG-2, which includes specifications for video compression, audio compression, storage, and transport. The storage and transport of coded video is based on the concept of an elementary stream. An elementary stream consists of coded data from a single source (e.g. video) plus ancillary data needed for synchronization, identification and characterization of the source information. An elementary stream is packetized into either constant-length or variable-length packets to form a Packetized Elementary Stream (PES). Each PES packet consists of a header followed by stream data called the payload. PES packets from various elementary streams are combined to form either a Program Stream (PS) or a Transport Stream (TS). PS is aimed at applications having negligible transmission errors, such as store-and-play type of applications. TS is aimed at applications that are susceptible of transmission errors. However, TS assumes that the network throughput is guaranteed to be constant.
There is a standardization effort going on in a Joint Video Team (JVT) of ITU-T and ISO/IEC. The work of JVT is based on an earlier standardization project in ITU-T called H.26L. The goal of the JVT standardization is to release the same standard text as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10 (MPEG-4 Part 10). The draft standard is referred to as the JVT coding standard in this paper, and the codec according to the draft standard is referred to as the JVT codec.
The codec specification itself distinguishes conceptually between a video coding layer (VCL), and a network abstraction layer (NAL). The VCL contains the signal processing functionality of the codec, things such as transform, quantization, motion search/compensation, and the loop filter. It follows the general concept of most of today's video codecs, a macroblock-based coder that utilizes inter picture prediction with motion compensation, and transform coding of the residual signal. The output of the VCL are slices: a bit string that contains the macroblock data of an integer number of macroblocks, and the information of the slice header (containing the spatial address of the first macroblock in the slice, the initial quantization parameter, and similar). Macroblocks in slices are ordered in scan order unless a different macroblock allocation is specified, using the so-called Flexible Macroblock Ordering syntax. In-picture prediction is used only within a slice.
The NAL encapsulates the slice output of the VCL into Network Abstraction Layer Units (NALUs), which are suitable for the transmission over packet networks or the use in packet oriented multiplex environments. JVT's Annex B defines an encapsulation process to transmit such NALUs over byte-stream oriented networks.
The optional reference picture selection mode of H.263 and the NEWPRED coding tool of MPEG-4 Part 2 enable selection of the reference frame for motion compensation per each picture segment, e.g., per each slice in H.263. Furthermore, the optional Enhanced Reference Picture Selection mode of H.263 and the JVT coding standard enable selection of the reference frame for each macroblock separately.
Reference picture selection enables many types of temporal scalability schemes. FIG. 1 shows an example of a temporal scalability scheme, which is herein referred to as recursive temporal scalability. The example scheme can be decoded with three constant frame rates. FIG. 2 depicts a scheme referred to as Video Redundancy Coding, where a sequence of pictures is divided into two or more independently coded threads in an interleaved manner. The arrows in these and all the subsequent figures indicate the direction of motion compensation and the values under the frames correspond to the relative capturing and displaying times of the frames.
Parameter Set Concept
One very fundamental design concept of the JVT codec is to generate self-contained packets, to make mechanisms such as the header duplication unnecessary. The way how this was achieved is to decouple information that is relevant to more than one slice from the media stream. This higher layer meta information should be sent reliably, asynchronously and in advance from the RTP packet stream that contains the slice packets. This information can also be sent in-band in such applications that do not have an out-of-band transport channel appropriate for the purpose. The combination of the higher level parameters is called a Parameter Set. The Parameter Set contains information such as picture size, display window, optional coding modes employed, macroblock allocation map, and others.
In order to be able to change picture parameters (such as the picture size), without having the need to transmit Parameter Set updates synchronously to the slice packet stream, the encoder and decoder can maintain a list of more than one Parameter Set. Each slice header contains a codeword that indicates the Parameter Set to be used.
This mechanism allows to decouple the transmission of the Parameter Sets from the packet stream, and transmit them by external means, e.g. as a side effect of the capability exchange, or through a (reliable or unreliable) control protocol. It may even be possible that they get never transmitted but are fixed by an application design specification.
Transmission Order
In conventional video coding standards, the decoding order of pictures is the same as the display order except for B pictures. A block in a conventional B picture can be bi-directionally temporally predicted from two reference pictures, where one reference picture is temporally preceding and the other reference picture is temporally succeeding in display order. Only the latest reference picture in decoding order can succeed the B picture in display order (exception: interlaced coding in H.263 where both field pictures of a temporally subsequent reference frame can precede a B picture in decoding order). A conventional B picture cannot be used as a reference picture for temporal prediction, and therefore a conventional B picture can be disposed without affecting the decoding of any other pictures.
The JVT coding standard includes the following novel technical features compared to earlier standards:                The decoding order of pictures is decoupled from the display order. The picture number indicates decoding order and the picture order count indicates the display order.        Reference pictures for a block in a B picture can either be before or after the B picture in display order. Consequently, a B picture stands for a bi-predictive picture instead of a bi-directional picture.        Pictures that are not used as reference pictures are marked explicitly. A picture of any type (intra, inter, B, etc.) can either be a reference picture or a non-reference picture. (Thus, a B picture can be used as a reference picture for temporal prediction of other pictures.)        A picture can contain slices that are coded with a different coding type. In other words, a coded picture may consist of an intra-coded slice and a B-coded slice, for example.        
Decoupling of display order from decoding order can be beneficial from compression efficiency and error resiliency point of view.
An example of a prediction structure potentially improving compression efficiency is presented in FIG. 3. Boxes indicate pictures, capital letters within boxes indicate coding types, numbers within boxes are picture numbers according to the JVT coding standard, and arrows indicate prediction dependencies. Note that picture B17 is a reference picture for pictures B18. Compression efficiency is potentially improved compared to conventional coding, because the reference pictures for pictures B18 are temporally closer compared to conventional coding with PBBP or PBBBP coded picture patterns. Compression efficiency is potentially improved compared to conventional PBP coded picture pattern, because part of reference pictures are bi-directionally predicted.
FIG. 4 presents an example of the intra picture postponement method that can be used to improve error resiliency. Conventionally, an intra picture is coded immediately after a scene cut or as a response to an expired intra picture refresh period, for example. In the intra picture postponement method, an intra picture is not coded immediately after a need to code an intra picture arises, but rather a temporally subsequent picture is selected as an intra picture. Each picture between the coded intra picture and the conventional location of an intra picture is predicted from the next temporally subsequent picture. As FIG. 4 shows, the intra picture postponement method generates two independent inter picture prediction chains, whereas conventional coding algorithms produce a single inter picture chain. It is intuitively clear that the two-chain approach is more robust against erasure errors than the one-chain conventional approach. If one chain suffers from a packet loss, the other chain may still be correctly received. In conventional coding, a packet loss always causes error propagation to the rest of the inter picture prediction chain.
Two types of ordering and timing information have been conventionally associated with digital video: decoding and presentation order. A closer look at the related technology is taken below.
A decoding timestamp (DTS) indicates the time relative to a reference clock that a coded data unit is supposed to be decoded. If DTS is coded and transmitted, it serves for two purposes: First, if the decoding order of pictures differs from their output order, DTS indicates the decoding order explicitly. Second, DTS guarantees a certain pre-decoder buffering behavior provided that the reception rate is close to the transmission rate at any moment. In networks where the end-to-end latency varies, the second use of DTS plays no or little role. Instead, received data is decoded as fast as possible provided that there is room in the post-decoder buffer for uncompressed pictures.
Carriage of DTS depends on the communication system and video coding standard in use. In MPEG-2 Systems, DTS can optionally be transmitted as one item in the header of a PES packet. In the JVT coding standard, DTS can optionally be carried as a part of Supplemental Enhancement Information (SEI), and it is used in the operation of the optional Hypothetical Reference Decoder. In ISO Base Media File Format, DTS is dedicated its own box type, Decoding Time to Sample Box. In many systems, such as RTP-based streaming systems, DTS is not carried at all, because decoding order is assumed to be the same as transmission order and exact decoding time does not play an important role.
H.263 optional Annex U and Annex W.6.12 specify a picture number that is incremented by 1 relative to the previous reference picture in decoding order. In the JVT coding standard, the frame number coding element is specified similarly to the picture number of H.263. The JVT coding standard specifies a particular type of an intra picture, called an instantaneous decoder refresh (IDR) picture. No subsequent picture can refer to pictures that are earlier than the IDR picture in decoding order. An IDR picture is often coded as a response to a scene change. In the JVT coding standard, frame number is reset to 0 at an IDR picture in order to improve error resilience in case of a loss of the IDR picture as is presented in FIGS. 5a and 5b. However, it should be noted that the scene information SEI message of the JVT coding standard can also be used for detecting scene changes.
H.263 picture number can be used to recover the decoding order of reference pictures. Similarly, the JVT frame number can be used to recover the decoding order of frames between an IDR picture (inclusive) and the next IDR picture (exclusive) in decoding order. However, because the complementary reference field pairs (consecutive pictures coded as fields that are of different parity) share the same frame number, their decoding order cannot be reconstructed from the frame numbers.
The H.263 picture number or JVT frame number of a non-reference picture is specified to be equal to the picture or frame number of the previous reference picture in decoding order plus 1. If several non-reference pictures are consecutive in decoding order, they share the same picture or frame number. The picture or frame number of a non-reference picture is also the same as the picture or frame number of the following reference picture in decoding order. The decoding order of consecutive non-reference pictures can be recovered using the Temporal Reference (TR) coding element in H.263 or the Picture Order Count (POC) concept of the JVT coding standard.
A presentation timestamp (PTS) indicates the time relative to a reference clock when a picture is supposed to be displayed. A presentation timestamp is also called a display timestamp, output timestamp, and composition timestamp.
Carriage of PTS depends on the communication system and video coding standard in use. In MPEG-2 Systems, PTS can optionally be transmitted as one item in the header of a PES packet. In the JVT coding standard, PTS can optionally be carried as a part of Supplemental Enhancement Information (SEI), and it is used in the operation of the Hypothetical Reference Decoder. In ISO Base Media File Format, PTS is dedicated its own box type, Composition Time to Sample Box where the presentation timestamp is coded relative to the corresponding decoding timestamp. In RTP, the RTP timestamp in the RTP packet header corresponds to PTS.
Conventional video coding standards feature the Temporal Reference (TR) coding element that is similar to PTS in many aspects. In some of the conventional coding standards, such as MPEG-2 video, TR is reset to zero at the beginning of a Group of Pictures (GOP). In the JVT coding standard, there is no concept of time in the video coding layer. The Picture Order Count (POC) is specified for each frame and field and it is used similarly to TR in direct temporal prediction of B slices, for example. POC is reset to 0 at an IDR picture.
Transmission of Multimedia Streams
A multimedia streaming system consists of a streaming server and a number of players, which access the server via a network. The network is typically packet-oriented and provides little or no means to guaranteed quality of service. The players fetch either pre-stored or live multimedia content from the server and play it back in real-time while the content is being downloaded. The type of communication can be either point-to-point or multicast. In point-to-point streaming, the server provides a separate connection for each player. In multicast streaming, the server transmits a single data stream to a number of players, and network elements duplicate the stream only if it is necessary.
When a player has established a connection to a server and requested for a multimedia stream, the server begins to transmit the desired stream. The player does not start playing the stream back immediately, but rather it typically buffers the incoming data for a few seconds. Herein, this buffering is referred to as initial buffering. Initial buffering helps to maintain pauseless playback, because, in case of occasional increased transmission delays or network throughput drops, the player can decode and play buffered data.
In order to avoid unlimited transmission delay, it is uncommon to favor reliable transport protocols in streaming systems. Instead, the systems prefer unreliable transport protocols, such as UDP, which, on one hand, inherit a more stable transmission delay, but, on the other hand, also suffer from data corruption or loss.
RTP and RTCP protocols can be used on top of UDP to control real-time communications. RTP provides means to detect losses of transmission packets, to reassemble the correct order of packets in the receiving end, and to associate a sampling time-stamp with each packet. RTCP conveys information about how large a portion of packets were correctly received, and, therefore, it can be used for flow control purposes.
Transmission Errors
There are two main types of transmission errors, namely bit errors and packet errors. Bit errors are typically associated with a circuit-switched channel, such as a radio access network connection in mobile communications, and they are caused by imperfections of physical channels, such as radio interference. Such imperfections may result into bit inversions, bit insertions and bit deletions in transmitted data. Packet errors are typically caused by elements in packet-switched networks. For example, a packet router may become congested; i.e. it may get too many packets as input and cannot output them at the same rate. In this situation, its buffers overflow, and some packets get lost. Packet duplication and packet delivery in different order than transmitted are also possible but they are typically considered to be less common than packet losses. Packet errors may also be caused by the implementation of the used transport protocol stack. For example, some protocols use checksums that are calculated in the transmitter and encapsulated with source-coded data. If there is a bit inversion error in the data, the receiver cannot end up into the same checksum, and it may have to discard the received packet.
Second (2G) and third generation (3G) mobile networks, including GPRS, UMTS, and CDMA-2000, provide two basic types of radio link connections, acknowledged and non-acknowledged. An acknowledged connection is such that the integrity of a radio link frame is checked by the recipient (either the Mobile Station, MS, or the Base Station Subsystem, BSS), and, in case of a transmission error, a retransmission request is given to the other end of the radio link. Due to link layer retransmission, the originator has to buffer a radio link frame until a positive acknowledgement for the frame is received. In harsh radio conditions, this buffer may overflow and cause data loss. Nevertheless, it has been shown that it is beneficial to use the acknowledged radio link protocol mode for streaming services. A non-acknowledged connection is such that erroneous radio link frames are typically discarded.
Packet losses can either be corrected or concealed. Loss correction refers to the capability to restore lost data perfectly as if no losses had ever been introduced. Loss concealment refers to the capability to conceal the effects of transmission losses so that they should not be visible in the reconstructed video sequence.
When a player detects a packet loss, it may request for a packet retransmission. Because of the initial buffering, the retransmitted packet may be received before its scheduled playback time. Some commercial Internet streaming systems implement retransmission requests using proprietary protocols. Work is going on in IETF to standardize a selective retransmission request mechanism as a part of RTCP.
A common feature for all of these retransmission request protocols is that they are not suitable for multicasting to a large number of players, as the network traffic may increase drastically. Consequently, multicast streaming applications have to rely on non-interactive packet loss control.
Point-to-point streaming systems may also benefit from non-interactive error control techniques. First, some systems may not contain any interactive error control mechanism or they prefer not to have any feedback from players in order to simplify the system. Second, retransmission of lost packets and other forms of interactive error control typically take a larger portion of the transmitted data rate than non-interactive error control methods. Streaming servers have to ensure that interactive error control methods do not reserve a major portion of the available network throughput. In practice, the servers may have to limit the amount of interactive error control operations. Third, transmission delay may limit the number of interactions between the server and the player, as all interactive error control operations for a specific data sample should preferably be done before the data sample is played back.
Non-interactive packet loss control mechanisms can be categorized to forward error control and loss concealment by post-processing. Forward error control refers to techniques in which a transmitter adds such redundancy to transmitted data that receivers can recover at least part of the transmitted data even if there are transmission losses. Error concealment by post-processing is totally receiver-oriented. These methods try to estimate the correct representation of erroneously received data.
Most video compression algorithms generate temporally predicted INTER or P pictures. As a result, a data loss in one picture causes visible degradation in the consequent pictures that are temporally predicted from the corrupted one. Video communication systems can either conceal the loss in displayed images or freeze the latest correct picture onto the screen until a frame which is independent from the corrupted frame is received.
In conventional video coding standards, the decoding order is coupled with the output order. In other words, the decoding order of I and P pictures is the same as their output order, and the decoding order of a B picture immediately follows the decoding order of the latter reference picture of the B picture in output order. Consequently, it is possible to recover the decoding order based on known output order. The output order is typically conveyed in the elementary video bitstream in the Temporal Reference (TR) field and also in the system multiplex layer, such as in the RTP header. Thus, in conventional video coding standards, the presented problem did not exist.
One solution that is evident for an expert in the field is to use a frame counter-similar to H.263 picture number without a reset to 0 at an IDR picture (as done in the JVT coding standard). However, some problems may occur when that kind of solutions are used. FIG. 5a presents a situation in which continuous numbering scheme is used. If, for example, the IDR picture 137 is lost (can not be received/decoded), the decoder continues to decode the succeeding pictures, but it uses a wrong reference picture. This causes error propagation to succeeding frames until the next frame, which is independent from the corrupted frame, is received and decoded correctly. In the example of FIG. 5b the frame number is reset to 0 at an IDR picture. Now, in a situation in which IDR picture 10 is lost, the decoder notifies that there is a big gap in picture numbering after the latest correctly decoded picture P36. The decoder can then assume that an error has occurred and can freeze the display to the picture P36 until the next frame which is independent from the corrupted frame is received and decoded.
Sub-Sequences
The JVT coding standard also includes a sub-sequence concept, which can enhance temporal scalability compared to the use of non-reference picture so that inter-predicted chains of pictures can be disposed as a whole without affecting the decodability of the rest of the coded stream.
A sub-sequence is a set of coded pictures within a sub-sequence layer. A picture shall reside in one sub-sequence layer and in one sub-sequence only. A sub-sequence shall not depend on any other sub-sequence in the same or in a higher sub-sequence layer. A sub-sequence in layer 0 can be decoded independently of any other sub-sequences and previous long-term reference pictures. FIG. 6a discloses an example of a picture stream containing sub-sequences at layer 1.
A sub-sequence layer contains a subset of the coded pictures in a sequence. Sub-sequence layers are numbered with non-negative integers. A layer having a larger layer number is a higher layer than a layer having a smaller layer number. The layers are ordered hierarchically based on their dependency on each other so that a layer does not depend on any higher layer and may depend on lower layers. In other words, layer 0 is independently decodable, pictures in layer 1 may be predicted from layer 0, pictures in layer 2 may be predicted from layers 0 and 1, etc. The subjective quality is expected to increase along with the number of decoded layers.
The sub-sequence concept is included in the JVT coding standard as follows: The required_frame_num_update_behaviour_flag equal to 1 in the sequence parameter set signals that the coded sequence may not contain all sub-sequences. The usage of the required_frame_num_update_behaviour_flag releases the requirement for the frame number increment of 1 for each reference frame. Instead, gaps in frame numbers are marked specifically in the decoded picture buffer. If a “missing” frame number is referred to in inter prediction, a loss of a picture is inferred. Otherwise, frames corresponding to “missing” frame numbers are handled as if they were normal frames inserted to the decoded picture buffer with the sliding window buffering mode. All the pictures in a disposed sub-sequence are consequently assigned a “missing” frame number in the decoded picture buffer, but they are never used in inter prediction for other sub-sequences.
The JVT coding standard also includes optional sub-sequence related SEI messages. The sub-sequence information SEI message is associated with the next slice in decoding order. It signals the sub-sequence layer and sub-sequence identifier (sub_seq_id) of the sub-sequence to which the slice belongs.
Each IDR picture contains an identifier (idr_pic_id). If two IDR pictures are consecutive in decoding order, without any intervening picture, the value of idr_pic_id shall change from the first IDR picture to the other one. If the current picture resides in a sub-sequence whose first picture in decoding order is an IDR picture, the value of sub_seq_id shall be the same as the value of idr_pic_id of the IDR picture.
The solution in JVT-D093 works correctly only if no data resides in sub-sequence layers 1 or above. If transmission order differs from decoding order and coded pictures resided in sub-sequence layer 1, their decoding order relative to pictures in sub-sequence layer 0 could not be concluded based on sub-sequence identifiers and frame numbers. For example, consider the following coding scheme presented on FIG. 6b where output order runs from left to right, boxes indicate pictures, capital letters within boxes indicate coding types, numbers within boxes are frame numbers according to the JVT coding standard, underlined characters indicate non-reference pictures, and arrows indicate prediction dependencies. If pictures are transmitted in order I0, P1, P3, I0, P1, B2, B4, P5, it cannot be concluded to which independent GOP picture B2 belongs.
It could be argued that in the previous example the correct independent GOP for picture B2 could be concluded based on its output timestamp. However, the decoding order of pictures cannot be recovered based on output timestamps and picture numbers, because decoding order and output order are decoupled. Consider the following example (FIG. 6c) where output order runs from left to right, boxes indicate pictures, capital letters within boxes indicate coding types, numbers within boxes are frame numbers according to the JVT coding standard, and arrows indicate prediction dependencies. If pictures are transmitted out of decoding order, it cannot be reliably detected whether picture P4 should be decoded after P3 of the first or second independent GOP in output order.
Buffering
Streaming clients typically have a receiver buffer that is capable of storing a relatively large amount of data. Initially, when a streaming session is established, a client does not start playing the stream back immediately, but rather it typically buffers the incoming data for a few seconds. This buffering helps to maintain continuous playback, because, in case of occasional increased transmission delays or network throughput drops, the client can decode and play buffered data. Otherwise, without initial buffering, the client has to freeze the display, stop decoding, and wait for incoming data. The buffering is also necessary for either automatic or selective retransmission in any protocol level. If any part of a picture is lost, a retransmission mechanism may be used to resend the lost data. If the retransmitted data is received before its scheduled decoding or playback time, the loss is perfectly recovered.
Coded pictures can be ranked according to their importance in the subjective quality of the decoded sequence. For example, non-reference pictures, such as conventional B pictures, are subjectively least important, because their absence does not affect decoding of any other pictures. Subjective ranking can also be made on data partition or slice group basis. Coded slices and data partitions that are subjectively the most important can be sent earlier than their decoding order indicates, whereas coded slices and data partitions that are subjectively the least important can be sent later than their natural coding order indicates. Consequently, any retransmitted parts of the most important slice and data partitions are more likely to be received before their scheduled decoding or playback time compared to the least important slices and data partitions.
Pre-Decoder Buffering
Pre-decoder buffering refers to buffering of coded data before it is decoded. Initial buffering refers to pre-decoder buffering at the beginning of a streaming session. Initial buffering is conventionally done for two reasons explained below.
In conversational packet-switched multimedia systems, e.g., in IP-based video conferencing systems, different types of media are normally carried in separate packets. Moreover, packets are typically carried on top of a best-effort network that cannot guarantee a constant transmission delay, but rather the delay may vary from packet to packet. Consequently, packets having the same presentation (playback) time-stamp may not be received at the same time, and the reception interval of two packets may not be the same as their presentation interval (in terms of time). Thus, in order to maintain playback synchronization between different media types and to maintain the correct playback rate, a multimedia terminal typically buffers received data for a short period (e.g. less than half a second) in order to smooth out delay variation. Herein, this type of a buffer component is referred as a delay jitter buffer. Buffering can take place before and/or after media data decoding.
Delay jitter buffering is also applied in streaming systems. Due to the fact that streaming is a non-conversational application, the delay jitter buffer required may be considerably larger than in conversational applications. When a streaming player has established a connection to a server and requested a multimedia stream to be downloaded, the server begins to transmit the desired stream. The player does not start playing the stream back immediately, but rather it typically buffers the incoming data for a certain period, typically a few seconds. Herein, this buffering is referred to as initial buffering. Initial buffering provides the ability to smooth out transmission delay variations in a manner similar to that provided by delay jitter buffering in conversational applications. In addition, it may enable the use of link, transport, and/or application layer retransmissions of lost protocol data units (PDUs). The player can decode and play buffered data while retransmitted PDUs may be received in time to be decoded and played back at the scheduled moment.
Initial buffering in streaming clients provides yet another advantage that cannot be achieved in conversational systems: it allows the data rate of the media transmitted from the server to vary. In other words, media packets can be temporarily transmitted faster or slower than their playback rate as long as the receiver buffer does not overflow or underflow. The fluctuation in the data rate may originate from two sources.
First, the compression efficiency achievable in some media types, such as video, depends on the contents of the source data. Consequently, if a stable quality is desired, the bit-rate of the resulting compressed bit-stream varies. Typically, a stable audio-visual quality is subjectively more pleasing than a varying quality. Thus, initial buffering enables a more pleasing audio-visual quality to be achieved compared with a system without initial buffering, such as a video conferencing system.
Second, it is commonly known that packet losses in fixed IP networks occur in bursts. In order to avoid bursty errors and high peak bit- and packet-rates, well-designed streaming servers schedule the transmission of packets carefully. Packets may not be sent precisely at the rate they are played back at the receiving end, but rather the servers may try to achieve a steady interval between transmitted packets. A server may also adjust the rate of packet transmission in accordance with prevailing network conditions, reducing the packet transmission rate when the network becomes congested and increasing it if network conditions allow, for example.
Hypothetical Reference Decoder (HRD)/Video Buffering Verifier (VBV)
Many video coding standards include a HRD/VBV specification as an integral part of the standard. The HRD/VBV specification is a hypothetical decoder model that contains an input (pre-decoder) buffer. The coded data flows in to the input buffer typically at a constant bit rate. Coded pictures are removed from the input buffer at their decoding timestamps, which may be the same as their output timestamps. The input buffer is of certain size depending on the profile and level in use. The HRD/VBV model is used to specify interoperability points from processing and memory requirements point of view. Encoders shall guarantee that a generated bitstream conforms to the HRD/VBV specification according to HRD/VBV parameter values of certain profile and level. Decoders claiming the support for a certain profile and level shall be able to decode the bitstream that conforms to the HRD/VBV model.
The HRD comprises a coded picture buffer for storing coded data stream and a decoded picture buffer for storing decoded reference pictures and for reordering decoded pictures in display order. The HRD moves data between the buffers similarly to the decoder of an decoding device does. However, the HRD need not decode the coded pictures entirely nor output the decoded pictures, but the HRD only checks that the decoding of the picture stream can be performed under the constraints given in the coding standard. When the HRD is operating, it receives a coded data stream and stores it to the coded picture buffer. In addition, the HRD removes coded pictures from the coded picture buffer and stores at least some of the corresponding hypothetically decoded pictures into the decoded picture buffer. The HRD is aware of the input rate according to which the coded data flows into the coded picture buffer, the removal rate of the pictures from the coded picture buffer, and the output rate of the pictures from the decoded picture buffer. The HRD checks for coded or decoded picture buffer overflows, and it indicates if the decoding is not possible with the current settings. Then the HRD informs the encoder about the buffering violation wherein the encoder can change the encoding parameters by, for example, reducing the number of reference frames, to avoid buffering violation. Alternatively or additionally, the encoder starts to encode the pictures with the new parameters and sends the encoded pictures to the HRD which again performs the decoding of the pictures and the necessary checks. As a yet another alternative, the encoder may discard the latest encoded frame and encode later frames so that no buffering violation happens.
Two types of decoder conformance have been specified in the JVT coding standard: output order conformance (VCL conformance) and output time conformance (VCL-NAL conformance). These types of conformance have been specified using the HRD specification. The output order conformance refers to the ability of the decoder to recover the output order of pictures correctly. The HRD specification includes a “bumping decoder” model that outputs the earliest uncompressed picture in output order when a new storage space for a picture is needed. The output time conformance refers to the ability of the decoder to output pictures at the same pace as the HRD model does. The output timestamp of a picture must always be equal to or smaller than the time when it would be removed from the “bumping decoder”.
Interleaving
Frame interleaving is a commonly used technique in audio streaming. In the frame interleaving technique, one RTP packet contains audio frames that are not consecutive in decoding or output order. If one packet in the audio packet stream is lost, the correctly received packets contain neighbouring audio frames which can be used for concealing the lost audio packet (by some sort of interpolating). Many audio coding RTP payload and MIME type specifications contain the possibility to signal the maximum amount of interleaving in one packet in terms of audio frames.
In some prior art encoding/decoding methods the size of the needed buffer is informed as a count of transmission units.