1. Field of the Invention
The invention relates generally to the field of processor chips and specifically to the field of single-instruction multiple-data (SIMD) processors. More particularly, the present invention relates to loading vector registers in a SIMD processor.
2. Description of the Background Art
Video compression algorithms use a scanning method to convert two-dimensional output of DCT values into one-dimensional values. JPEG, MPEG-1, MPEG-2, MPEG-4, H.263, and H.264 all uses multiple types of scan operation for compression and the inverse for decompression. For example, for video decompression MPEG-4 has three types of inverse scan: Alternate horizontal-scan, alternate vertical scan, and zigzag scan. All of these require different type of mapping of 8-by-8 two-dimensional (2-D) array of element values. The new standard H.264, which is part of MPEG-4 Part 10, also requires scanning mapping for block sizes of 4-by-4 and 2-by-2.
Another common operation for video compression and decompression standards is the matrix transpose operation. Most of video compression and decompression standards require an 8-by-8 Discrete Cosine Transform (DCT) operation and its inverse (IDCT), which is typically implemented as eight one-dimensional (1-D) DCT/IDCT operations, followed by matrix transpose, and followed by another eight 1-D DCT/IDCT operations. The H.264 standard also requires DCT of 4×4 and 2×2 block sizes.
SIMD processors provide a speed advantage by processing N pixel values at the same time. While it is well understood how to use SIMD instructions for data processing such as implementation of FIR filters, it is not obvious how or if SIMD architecture could also be applied for speedup of matrix transpose and scan conversion operations.
The current implementations either use software method with little or no parallelism for implementing scan conversion and matrix transpose, or they use a dedicated hardware block to accelerate these operations. Implementing these instructions in software would require about 128 instructions using load-store model RISC architecture, such as MIPS-4000. The software approach lacks the performance needed to implement the compression and decompression algorithms for resolutions and frame rates required by Digital TV and set-top boxes.
Dedicated hardware transpose units along with SIMD processor have also been used to implement this operation. Dedicated hardware implementation requires different blocks for scan conversion and matrix transpose and lacks the flexibility to support new compression algorithms such as H.264 that require different block sizes.