1. Field of the Invention
This invention relates to Single Instruction Multiple Data (SIMD) technology, and particularly to a method for handling SIMD architecture restrictions through data reshaping, padding, and alignment.
2. Description of Background
Single Instruction Multiple Data (SIMD) is a set of special operations supported by various processors to perform a single operation on several numbers simultaneously. SIMD support enables compilers to exploit fine-grained parallelism by vectorizing loops that perform a single operation on multiple elements in a data set. Although vectorization has been studied for traditional vector processors, there are several challenges for effectively exploiting SIMDization due to the more restricitive SIMD architectures. The SIMD vectorization (a.k.a. SIMDization) is implemented in IBM's XL product compiler, and thus supports multiple programming languages (e.g., C, C++, and Fortran), and multiple target machines (e.g., SPU, VMX, and BG/L).
Many SIMD units support loads and stores from vector-length aligned memory only and memory access is in a contiguous chunk of vector-length bytes.
There are several issues related to automatic generation of SIMD code (referred to as SIMDization). These issues are listed below.
A first issue concerns alignment. In other words, accessing a block of memory from a location, which is not aligned on a natural vector-size boundary, is often prohibited or bears a heavy performance penalty. To handle the alignment problem, techniques like loop peeling, loop versioning, and static and dynamic alignment detection are typically used.
A second issue concerns out-of-boundary memory access and false sharing. For instance, vector load and store in the first and last few iterations in a loop could access memory out of its boundary although the loop only operates the memory locations within its own boundary. This causes a memory violation, e.g., memory accesses beyond a memory segment are required to generate a memory violation. Also it causes non-deterministic behaviors on multiple threading environments. One way of typically handling this issue is by adding a prologue and epilogue loop to check the boundary, and process the first and last few iterations.
A third issue concerns contiguous memory accesses with vector-length bytes. In other words, a load or store instruction loads or stores a 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. A fourth issue invloves isomorphic statements. For instance, for an array, a loop is SIMDized since statements in the loop are isomorphic and operate on all fields of a structure. However, when there is a mixture of operations, the loop cannot be SIMDized.
Considering the limitations of the aforementioned methods, it is clear that there is a need for an efficient method for handling SIMD architecture restrictions through data reshaping, padding, and alignment.