1. Field
The disclosure relates generally to an improved data processing apparatus and method, and more specifically, to mechanisms for determining the most efficient loops to parallel in a group of instructions.
2. Description of the Related Art
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as single instruction multiple datapath (SIMD) units that support packed fixed-length vectors. The traditional programming model for multimedia extensions has been explicit vector programming using either (in-line) assembly or intrinsic functions embedded in a high-level programming language. Explicit vector programming is time-consuming and error-prone. A promising alternative to exploit vectorization technology is to automatically generate SIMD codes from programs written in standard high-level languages.
Although vectorization has been studied extensively for traditional vector processors decades ago, vectorization for SIMD architectures has raised new issues due to several fundamental differences between the two architectures. To distinguish between the two types of vectorization, the latter is referred to as SIMD vectorization, or SIMDization. One such fundamental difference comes from the memory unit. The memory unit of a typical SIMD processor bears more resemblance to that of a wide scalar processor than to that of a traditional vector processor. In the VMX instruction set found on PowerPC® microprocessors produced by International Business Machines Corporation of Armonk, N.Y., for example, a load instruction loads 16-byte contiguous memory from 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. The same applies to store instructions.
There has been a recent spike of interest in compiler techniques to automatically extract SIMD or data parallelism from programs. This upsurge has been driven by the increasing prevalence of SIMD architectures in multimedia processors and high-performance computing. These processors have multiple function units, for example, floating point units, fixed point units, integer units, etc., which can execute more than one instruction in the same machine cycle to enhance the uni-processor performance. The function units in these processors are typically pipelined.
Extracting data parallelism from an application is a difficult task for a compiler. In most cases, except for the most trivial loops in the application code, the extraction of parallelism is a task the application developer must perform. This typically requires a restructuring of the application to allow the compiler to extract the parallelism or explicitly code the parallelism using multiple threads, a SIMD intrinsic, or vector data types available in new programming models, such as OpenCL.
Before a compiler can determine if a portion of code can be parallelized and thereby perform data parallel compilation of the code, the compiler must prove that the portion of code is independent and no data dependencies between the portion of code and other code called by that code exist. Procedure calls are an inhibiting factor to data parallel compilation. That is, data parallel compilation is only possible when the compiler can prove that the code will correctly execute when data parallel optimizations are performed. When the code calls a procedure, subroutine, or the like, from different portions of code, object modules, or the like that are not visible to the compiler at the time of compilation, such data parallel compilation is not possible since the compiler cannot verify that the code will correctly execute when the data parallel optimizations are performed.