When video is streamed over the Internet and played back through a Web browser or media player, the video is delivered in digital form. Digital video is also used when video is delivered through many broadcast services, satellite services and cable television services. Real-time videoconferencing often uses digital video, and digital video is used during video capture with most smartphones, Web cameras and other video capture devices.
Digital video can consume an extremely high amount of bits. The number of bits that is used per second of represented video content is known as the bit rate. Engineers use compression (also called source coding or source encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video information by converting the information into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
Over the last two decades, various video codec standards have been adopted, including the H.261, H.262 (MPEG-2 or ISO/IEC 13818-2), H.263 and H.264 (MPEG-4 AVC or ISO/IEC 14496-10) standards and the MPEG-1 (ISO/IEC 11172-2), MPEG-4 Visual (ISO/IEC 14496-2) and SMPTE 421M standards. In particular, decoding according to the H.264 standard is widely used in game consoles and media players to play back encoded video. H.264 decoding is also widely used in set-top boxes, personal computers, smartphones and other mobile computing devices for playback of encoded video streamed over the Internet or other networks. A video codec standard typically defines options for the syntax of an encoded video bitstream, detailing parameters in the bitstream when particular features are used in encoding and decoding. In many cases, a video codec standard also provides details about the decoding operations a decoder should perform to achieve correct results in decoding.
Several factors affect quality of video information, including spatial resolution, frame rate and distortion. Spatial resolution generally refers to the number of samples in a video image. Images with higher spatial resolution tend to look crisper than other images and contain more discernable details. Frame rate is a common term for temporal resolution for video. Video with higher frame rate tends to mimic the smooth motion of natural objects better than other video, and can similarly be considered to contain more detail in the temporal dimension. During encoding, an encoder can selectively introduce distortion to reduce bit rate, usually by quantizing video information during encoding. If an encoder introduces little distortion, the encoder maintains quality at the cost of higher bit rate. An encoder can introduce more distortion to reduce bit rate, but quality typically suffers. For these factors, the tradeoff for high quality is the cost of storing and transmitting the information in terms of bit rate.
When encoded video is delivered over the Internet to set-top boxes, mobile computing devices or personal computers, one video source can provide encoded video to multiple receiver devices. Or, in a videoconference, one device may deliver encoded video to multiple receiver devices. Different receiver devices may have different screen sizes or computational capabilities, with some devices able to decode and play back high quality video, and other devices only able to play back lower quality video. Also, different receiver devices may use network connections having different bandwidths, with some devices able to receive higher bit rate (higher quality) encoded video, and other devices only able to receive lower bit rate (lower quality) encoded video.
Scalable video coding and decoding provide one way to deliver different versions of video at different levels of distortion, temporal quality and/or spatial resolution quality. With scalable video coding, an encoder splits video into a base layer and one or more enhancement layers. The base layer alone provides a reconstruction of the video at a lower quality level (e.g., lower frame rate, lower spatial resolution and/or higher distortion). One or more enhancement layers can be decoded along with the base layer video data to provide a reconstruction with increased video quality in terms of higher frame rate, higher spatial resolution and/or lower distortion. Scalability in terms of distortion is sometimes called signal-to-noise ratio (“SNR”) scalability. A receiver device can receive a scalable video bitstream and decode those parts of it appropriate for the receiver device, which may the base layer video only, the base layer video plus some of the enhancement layer video, or the base layer video plus all enhancement layer video. Or, a video source, media server or given receiver device can select an appropriate version of video for delivery to the receiver device, considering available network bandwidth, screen size, computational capabilities, or another characteristic of the receiver device, and deliver only layers for that version of the video to the receiver device.
While some video decoding operations are relatively simple, others are computationally complex. For example, inverse frequency transforms, fractional sample interpolation operations for motion compensation, in-loop deblock filtering, post-processing filtering, color conversion, and video re-sizing can require extensive computation. This computational complexity can be problematic in various scenarios, such as decoding of high-quality, high-bit rate video (e.g., compressed high-definition video).
Thus, some decoders use hardware acceleration to offload certain computationally intensive operations to a graphics processor or other special-purpose hardware. For example, in some configurations, a computer system includes a primary central processing unit (“CPU”) as well as a graphics processing unit (“GPU”) or other hardware specially adapted for graphics processing or video decoding. A decoder uses the primary CPU as a host decoder to control overall decoding and uses the GPU to perform operations that collectively require extensive computation, accomplishing video acceleration. In a typical software architecture for hardware-accelerated video decoding, a host decoder controls overall decoding and may perform some operations such as bitstream parsing using the CPU. The decoder signals control information (e.g., picture parameters, slice parameters) and encoded data to a device driver for an accelerator (e.g., with GPU) across an acceleration interface. Some existing hardware acceleration architectures are adapted for decoding non-scalable bitstreams, but they do not sufficiently address the requirements of hardware-accelerated decoding of video encoded using scalable video coding.