Today, at least two different general approaches are employed for implementing processing units: the general purpose central processing unit (CPU) and the special purpose graphics processing unit (GPU). GPUs are specialized for calculating 3-dimensional (3D) scenes to be mapped to 2-dimensional (2D) scenes for being displayed, and have parallel architectures that enable them of highly parallel processing. GPUs have therefore high processing power. However, most of the common programming applications are optimized for sequential processing on CPUs.
Utilizing GPUs to accelerate video encoding and decoding is therefore desirable. Traditionally, in order to benefit from the powerful GPU, computation tasks (such as image or video processing etc.) would have to be re-formulated to be a 3D rendering task, so that their data are organized as graphics data, and a graphics API (Application Programming Interface) would be used. This makes GPGPU (General-Purpose computation on GPU) difficult and programs complicated.
In order to ease and improve the GPGPU realization, NVIDIA Corp. released “Compute Unified Device Architecture” (CUDA) for the GeForce 8800 Series GPU and beyond. CUDA is a hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without mapping them to a graphics API. CUDA also improves the memory access efficiency.
Generally, each sequentially operated program, and each sequentially operated branch of a parallel program, is a so-called thread. Threads operate rather autonomously on their individual input data and provide output data. Input data are read from a buffer, and output data are written to a buffer. GPUs have two basic types of memories or buffers: texture storage on GPUs is usually different from other memory types, in order to enable more efficient access. In the terminology of CUDA, which is used herein, these are so-called global memory and texture memory. Global memory provides read and write access to all threads but is rather slow, while texture memory provides read-only access to threads but is fast. Data from the global memory can be copied into the texture memory. This structure is optimized for typical GPU tasks, such as texture mapping. Texture is a 2D pattern that is mapped to the surface of 3D objects.
CUDA provides multiple multi-processors to do the same computation task on different data units simultaneously. It also provides general DRAM memory addressing methods, giving programmers flexibility to read and write data at any location in DRAM. Furthermore, it features a parallel data cache (on-chip shared memory) with very fast general read and write access, to support efficient data sharing. However, the DRAM and the cache are very limited in size and not sufficient for many tasks. Moreover, shared memory can't be accessed by host functions, i.e. functions running on a CPU when a GPU works as a co-processor of a CPU. In this case, program and data would have to be managed by the CPU first before the control goes to GPU.
GPUs may operate on multiple data layers in parallel. Usually, the GPU has four data layers, which are normally used for YRGB data per pixel. E.g. the four 8-bit elements of an input pixel can be stored as a 4D input vector and then processed independently and simultaneously.
Videos are often encoded according to the MPEG-2 standard, which comprises segmenting a picture into macroblocks (MB), and sequentially processing lines of MBs. The respective decoding process is depicted in FIG. 1 and comprises mainly variable-length decoding 101, inverse scan 102, inverse quantization 103, inverse discrete cosine transform (iDCT) 104 and motion compensation (MC) 105. Motion compensation uses previously decoded pictures as reference. These were therefore stored in a frame memory 106. Finally, the decoded samples of the picture are output to a display.
One problem is how to map a complex, sequential task such as video decoding to a combined CPU-GPU hardware platform, and particularly a CUDA enabled platform with the above-described memory structure. While e.g. WO2004/095708 provides a general approach, it is still difficult to assign the different modules of such complex process to different hardware processing units (CPU and GPU) such that an optimized balance of the CPU and GPU workloads is achieved. Ideally, time costs should be almost equal between CPU and GPU, i.e. neither CPU nor GPU should have to wait for results from the other units.