The present application relates generally to an improved data processing apparatus and method and more specifically to a parallel execution unit that extracts data parallelism at runtime.
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as Single Instruction Multiple Datapath (SIMD) units that support packed fixed-length vectors. The traditional programming model for multimedia extensions has been explicit vector programming using either (in-line) assembly or intrinsic functions embedded in a high-level programming language. Explicit vector programming is time-consuming and error-prone. A promising alternative is to exploit vectorization technology to automatically generate SIMD codes from programs written in standard high-level languages.
Although vectorization has been studied extensively for traditional vector processors decades ago, vectorization for SIMD architectures has raised new issues due to several fundamental differences between the two architectures. To distinguish between the two types of vectorization, the latter is referred to as SIMD vectorization, or SIMDization. One such fundamental difference comes from the memory unit. The memory unit of a typical SIMD processor bears more resemblance to that of a wide scalar processor than to that of a traditional vector processor. In the VMX instruction set found on certain PowerPC microprocessors (produced by International Business Machines Corporation of Armonk, N.Y.), for example, a load instruction loads 16-byte contiguous memory from 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. The same applies to store instructions.
There has been a recent spike of interest in compiler techniques to automatically extract SIMD or data parallelism from programs. This upsurge has been driven by the increasing prevalence of SIMD architectures in multimedia processors and high-performance computing. These processors have multiple function units, e.g., floating point units, fixed point units, integer units, etc., which can execute more than one instruction in the same machine cycle to enhance the uni-processor performance. The function units in these processors are typically pipelined.
Extracting data parallelism from an application is a difficult task for a compiler. In most cases, except for the most trivial loops in the application code, the extraction of parallelism is a task the application developer must perform. This typically requires a restructuring of the application to allow the compiler to extract the parallelism or explicitly coding the parallelism using multiple threads, a SIMD intrinsic, or vector data types available in new programming models, such as OpenCL.
Before a compiler can determine if a program loop can be parallelized, the compiler must prove that each pass through the programming loop is independent and no data dependencies between successive loops exist, i.e. one iteration of the loop does not depend on the value generated in a previous iteration of a loop or a current iteration of the loop does not generate a value that will cause a subsequent iteration of the loop to access incorrect data by writing or storing to a same memory location that a subsequent iteration accesses. Take the following loop as an example:
for (i=0; i<N; i++) {A[i] = foo(i, h, d, p, x);}
This loop sets A[i] to the return value from function “foo.” If the compiler cannot see the function “foo” at compile time, e.g., the function “foo” is in a different code that is called by the present program being compiled, the compiler has no choice but to assume the loop cannot be performed in parallel fashion and thus, generates scalar code for the loop, i.e. non-parallel code (sequential code). By in-lining the function “foo,” the compiler can examine the code and possibly discover the parallelism, but the codes size of the application may grow substantially with such in-lining. Even if the compiler can examine all the code within the loop, there are cases where it is impossible to determine if parallelizing the loop is safe, i.e. there are no dependencies between iterations of the loop. For example, consider the following code example for the function “foo”:
tmp1 = h[i] + d[i];if (tmp1 < x[tmp1])h[tmp1] = tmp1;return p[tmp1];
In this code segment, the contents of the array “h” are conditionally updated based on the data within the arrays “h”, “d”, and “x”. For this code, it is impossible for a compiler, or even the application developer, to guarantee that all iterations of the loop can be performed in parallel. The compiler and/or application developer therefore, can only perform the loop as a scalar operation, even though for certain data values (or perhaps all data values), the update of array “h” in an iteration of the loop does not affect the results of subsequent loop iterations.