1. Background and Relevant Art
Computer systems and related technology affect many aspects of society. Indeed, the computer system's ability to process information has transformed the way we live and work. Computer systems now commonly perform a host of tasks (e.g., word processing, scheduling, accounting, image processing, etc.) that prior to the advent of the computer system were performed manually. More recently, computer systems have been coupled to one another and to other electronic devices to form both wired and wireless computer networks over which the computer systems and other electronic devices can transfer electronic data. Accordingly, the performance of many computing tasks is distributed across a number of different computer systems and/or a number of different computing environments. For example, distributed applications can have components at a number of different computer systems.
In some environments, video data is streamed from one computer system to another computer system over a computer network, such as, for example, the Internet. At many resolutions, transferring raw video is not practical due to the sheer volume of data. As such, compression algorithms are used to reduce the volume of data transferred over a network. A sending computer system sends compressed (encoded) video data to a receiving computer system over a network. The receiving computer system receives the compressed video data over the network. The receiving computer system then uncompresses (decodes) the compressed video data for presentation at video output device, such as, a television or computer monitor. Video data can be compressed in accordance with various different encoding formats, including H.264 (Advanced Video Coding (AVC)), High Efficiency Video Coding (HEVC) (H.265), VP8, VP9, etc.
When decoding encoded video data, some decoding operations can be handled by a software decoder (which may also be referred to as a host decoder). The software decoder can offload other decoding operations to hardware, such as, for example, to a Graphical Processing Unit (GPU) (which may also be referred to as an accelerator), for hardware decoding. Hardware decoding can be used to speed up CPU-intensive operations (e.g., inverse discrete cosine transforms (iDCTs)). Hardware decoding is often necessary for decoding HEVC contents, which are at high bit rate, high frame rate, and high resolution.
To offload a decoding operation for hardware decoding, the software decoder conveys one or more input buffers containing information needed to perform the operation. The software decoder can also form one or more output buffers for storing results from hardware decoding. Decode hardware (e.g., a GPU) accesses information from input buffers, performs the decoding operation, and outputs results to output buffers.
A video bit stream can include parameters defining a code block size, such as, for example, 8×8 pixels. The software decoder can access the parameters from the video bit stream and process data in accordance with the parameters. When decoding operations are to be offloaded to the decode hardware, the software decoder can also indicate the code block size to the hardware. The software decoder can also allocate input and output buffers of the code block size for use by the decode hardware.
However, decode hardware can have surface alignment requirements differing from a defined code block size. Surface alignment requirements essentially dictate that data be processed using a buffer size defined by the surface alignment requirements. Due to differences in software alignment requirements and defined code block size, buffers allocated by a software decoder can differ in size from buffers used within decode hardware.
For example, coding blocks in a video bit stream can be 8×8 pixels and surface alignment requirements of decode hardware can be 32×32 pixels. Thus, when offloading an operation to hardware, a software decoder indicates the 8×8 code block size to the decode hardware and allocates 8×8 pixel input and output buffers for use by the decode hardware. The decode hardware can access an 8×8 input buffer into a 32×32 internal buffer. The hardware can perform the offloaded operation on data in the 32×32 internal buffer to generate output. The output is also stored in a 32×32 internal buffer.
However, since the code block size is 8×8, the hardware has to perform additional memory operations to prepare output for direct storage in an 8×8 output buffer. These additional memory operations can be performed on a per code block basis. When processing video bit streams (which can contain large numbers of code blocks per frame), these additional memory operations decrease performance and consume significant processing and power resources at the decode hardware.