When video is streamed over the Internet and played back through a Web browser or media player, the video is delivered in digital form. Digital video is also used when video is delivered through many broadcast services, satellite services and cable television services. Real-time videoconferencing often uses digital video, and digital video is used during video capture with most smartphones, Web cameras and other video capture devices.
Digital video can consume an extremely high amount of bits. Engineers use compression (also called source coding or source encoding) to reduce the bitrate of digital video. Compression decreases the cost of storing and transmitting video information by converting the information into a lower bitrate form. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
Over the last two decades, various video codec standards have been adopted, including the H.261. H.262 (MPEG-2 or ISO/IEC 13818-2), H.263 and H.264 (AVC or ISO/IEC 14496-10) standards and the MPEG-1 (ISO/IEC 11172-2). MPEG-4 Visual (ISO/IEC 14496-2) and SMPTE 421M standards. In particular, decoding according to the H.264 standard is widely used in game consoles and media players to play back encoded video. H.264 decoding is also widely used in set-top boxes, personal computers, smart phones and other mobile computing devices for playback of encoded video streamed over the Internet or other networks. A video codec standard typically defines options for the syntax of an encoded video bitstream, detailing parameters in the bitstream when particular features are used in encoding and decoding. In many cases, a video codec standard also provides details about the decoding operations a decoder should perform to achieve correct results in decoding.
Several factors affect quality of video information, including spatial resolution, frame rate and distortion. Spatial resolution generally refers to the number of samples in a video image. Images with higher spatial resolution tend to look crisper than other images and contain more discernable details. Frame rate is a common term for temporal resolution for video. Video with higher frame rate tends to mimic the smooth motion of natural objects better than other video, and can similarly be considered to contain more detail in the temporal dimension. During encoding, an encoder can selectively introduce distortion to reduce bitrate, usually by quantizing video information during encoding. If an encoder introduces little distortion, the encoder maintains quality at the cost of higher bitrate. An encoder can introduce more distortion to reduce bitrate, but quality typically suffers. For these factors, the tradeoff for high quality is the cost of storing and transmitting the information in terms of bitrate.
When encoded video is delivered over the Internet to set-top boxes, mobile computing devices or personal computers, one video source can provide encoded video to multiple receiver devices. Or, in a videoconference, one device may deliver encoded video to multiple receiver devices. Different receiver devices may have different screen sizes or computational capabilities, with some devices able to decode and play back high quality video, and other devices only able to play back lower quality video. Also, different receiver devices may use network connections having different bandwidths, with some devices able to receive higher bitrate (higher quality) encoded video, and other devices only able to receive lower bitrate (lower quality) encoded video.
In such scenarios, with simulcast encoding and delivery, video is encoded in multiple different ways to provide versions of the video at different levels of distortion, temporal quality and/or spatial resolution quality. Each version of video is represented in a bitstream that can be decoded to reconstruct that version of the video, independent of decoding other versions of the video. A video source (or given receiver device) can select an appropriate version of video for delivery to the receiver device, considering available network bandwidth, screen size, computational capabilities, or another characteristic of the receiver device.
Scalable video coding (SVC) and decoding are another way to provide different versions of video at different levels of distortion, temporal quality and/or spatial resolution quality. With SVC, an encoder splits video into a base layer and one or more enhancement layers. The base layer alone provides a reconstruction of the video at a lower quality level (e.g., lower frame rate, lower spatial resolution and/or higher distortion). One or more enhancement layers can be reconstructed and added to reconstructed base layer video to increase video quality in terms of higher frame rate, higher spatial resolution and/or lower distortion. Scalability in terms of distortion is sometimes called SNR scalability.
In some respects, SVC outperforms simulcast transmission because SVC exploits redundancy between different versions of the video. Usually, for a given level of quality, the combined bitrate of the base layer and enhancement layer(s) is slightly higher than the bitrate of an independently decodable simulcast version of the video. The bitrate of an enhancement layer by itself, however, is lower than the bitrate of the independently decodable version of the video. For all of the levels of quality, the collective bitrate of the base layer and enhancement layers is much lower than the collective bitrate of the different simulcast versions of the video. For this reason. SVC reduces uplink bandwidth utilization when video is uploaded from an encoder site to a delivery server on a network.
The performance of SVC can be limited in other respects, however. First, many hardware encoders do not support SVC that is fully scalable across all aspects of quality. For example, many web cameras can encode H.264 video with at most two temporal layers, which limits possible operational points for quality layers. Second, in extreme cases, when quality differs too much between two successive SNR layers, the efficiency of SVC can be worse than simply splitting the video into two simulcast streams for the two levels of SNR quality, respectively. Third, if downstream network bandwidth is a bottleneck between a delivery server and receiver devices, simulcast may be preferable since SVC video uses more bits that simulcast video for a given level of quality. Fourth, some SVC bitstreams require that temporal prediction structure be the same across spatial quality layers and SNR quality layers, which can limit the flexibility of SVC. Finally, providing spatial scalability in an SVC bitstream can increase computational requirements, memory usage and encoding latency. When spatial quality layers for a higher resolution depend on spatial quality layers at a lower resolution, the spatial layers at the lower resolution are typically generated, encoded, reconstructed and buffered for use in predicting the higher resolution layers, which adds delay and frame memory. These costs of spatial scalability have hindered its adoption in hardware encoders and decoders.