The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for insertion of operation-and-indicate instructions for optimized Single Instruction Multiple Data (SIMD) code.
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as Single Instruction Multiple Data (SIMD) path units that support packed fixed-length vectors. The traditional programming model for multimedia extensions has been explicit vector programming using either (in-line) assembly or intrinsic functions embedded in a high-level programming language. Explicit vector programming is time-consuming and error-prone. A promising alternative is to exploit vectorization technology to automatically generate SIMD codes from programs written in standard high-level languages.
Although vectorization has been studied extensively for traditional vector processors decades ago, vectorization for SIMD architectures has raised new issues due to several fundamental differences between the two architectures. To distinguish between the two types of vectorization, the latter is referred to as SIMD vectorization, or SIMDization. One such fundamental difference comes from the memory unit. The memory unit of a typical SIMD processor bears more resemblance to that of a wide scalar processor than to that of a traditional vector processor. In the VMX instruction set found on certain PowerPC microprocessors (produced by International Business Machines Corporation of Armonk, N.Y.), for example, a load instruction loads 16-byte contiguous memory from 16-byte aligned memory, ignoring the last 4 bits of the memory address in the instruction. The same applies to store instructions.
There has been a recent spike of interest in compiler techniques to automatically extract SIMD parallelism from programs. This upsurge has been driven by the increasing prevalence of SIMD architectures in multimedia processors and high-performance computing. These processors have multiple function units, e.g., floating point units, fixed point units, integer units, etc., which can execute more than one instruction in the same machine cycle to enhance the uni-processor performance. The function units in these processors are typically pipelined.
In performing compiler based transformations of loops to extract SIMD parallelism, it is important to ensure array reference safety. That is, during compilation of source code for execution by a SIMD architecture, the compiler may perform various optimizations including determining portions of code that may be parallelized for execution by the SIMD architecture. This parallelization typically involves vectorizing, or SIMD vectorizing, or SIMDizing, the portion of code. One such optimization involves the conversion of branches in code to predicated operations in order to avoid the branch misprediction penalties encountered by pipelined function units. This optimization involves converting conditional branches in source code to predicated code with predicate operations using comparison instructions to set up Boolean predicates corresponding to the branch conditions. Thus, the predicates, which now guard the instructions, either execute or nullify the instruction according to the predicate's value, a process called commonly referred to as “if-conversion.”
In short, predicated code generated by traditional if-conversion generates straight-line code by executing instructions from two mutually exclusive execution paths, suppressing instructions corresponding to one of the two mutually exclusive paths. It is quite common for one of these mutually exclusive execution paths to generate a variety of undesirable erroneous execution effects and, in particular, illegal memory references, when this path does not correspond to the chosen path. Accordingly, “if-conversion” might result in erroneous executions if it were not for the nullification of non-selected predicated instructions in accordance with “if-conversion”, and in particular for memory reference instructions in if-converted code.
Gschwind et al., “Synergistic Processing in Cell's Multicore Architecture”, IEEE Micro, March 2006 introduces the concept of data-parallel if-conversion which is being increasingly widely adopted for compilation for data-parallel SIMD architectures. Unlike traditional scalar if-conversion, data-parallel if-conversion typically targets code generation with data-parallel select as supported by many SIMD architectures, as described in co-pending and commonly assigned U.S. Patent Application Publication No. US20080034357A1, filed Aug. 4, 2006, entitled “Method and Apparatus for Generating Data Parallel Select Operations in a Pervasively Data Parallel System” to Gschwind et al., because data-parallel SIMD architectures typically do not offer predicated execution.
Thus, traditional if-conversion guards each instruction with a predicate indicating the execution or non-execution of each instruction corresponding to one or another of mutually exclusive paths. The data-parallel if-conversion with data-parallel select described in the Gschwind et al. patent application publication executes instructions from both paths without a predicate and uses data-parallel select instructions to select a result corresponding to an unconditionally executed path in the compiled code exactly when it corresponds to a taken path in the original source code. Thus, while data-parallel select can be used to implement result selection based on taken-path information, data-parallel if-conversion with data-parallel select is not adapted to nullify instructions. This is because a vector instruction may have one part of its result vector selected when another part of its result vector is not selected, making traditional instruction predication impractical.
The differences between traditional if-conversion and data parallel if-conversion using data-parallel select operations may be more easily understood with regard to the following example code, provided in QPX Assembly language:
a[i]=b[i]/=0? 1/b[i]: DEFAULT;
Traditional if conversion would implement this code in a form as follows:
; init FRZEROS = register initialized with 0.0; init FRDEFAULT = register preloaded with the fault of DEFAULTLFDFBI = b[i]FCMPEQpredicate, FRZERO, FBIFRE<NOT predicate>FAI, FBI<===== conditionallyexecutedif predicate indicates that b[i] /= 0, and suppress result andexceptions if b[i]==0FMR<predicate>FAI, FRDEFAULT<===== conditionallyexecutedif predicate indicates that b[i]==0, and suppress move if b[i] /=0QVSTFDa[i] = FAIAs can be seen, if the predicate condition indicates that the instructions should not be executed, then the result and all associated side effects, such as exceptions, are suppressed. FRE will either generate a single result, in which case it is written and an exception is raised if appropriate, or does not write a single result, in which case the result is not written and no exception is raised.
Consider now the code generated by SIMD vectorization and data-parallel if conversion by exploiting data parallel select, e.g., on an exemplary 4 element vector:
; init QRZEROS = vector with 0.0 elements; init QRDEFAULTS = vector with DEFAULT elementsQVLFDQBI = b[i:i+3]QVFREQTRE, QBI     <===== may raise spurious divide byzero if vector instructions are allowed to raise exceptionsQVFCMPQTC, QRZEROS, QBIQVFSELAQI, QTRE, QRDEFAULTS, QTCQVSTFDa[i:i+3] = QAIIn accordance with this example, the QVFRE instruction is not predicated and always writes a result. As noted above, the FRE instruction will either write its result, because it generates a single result in which case it is written and an exception is raised if appropriate, or it does not write a single result, in which case the result is not written and no exception is raised. Unlike the FRE instruction, the QVFRE instruction may generate 0, 1, 2, 3, or 4 results to be written back to the vector a[i:i+3]. However, the knowledge on whether a result will be used is not available to the QVFRE instruction and so, it cannot generate the right set of exceptions.
Thus, with data parallel if-conversion being used by compilers to generate SIMDized code for execution in a SIMD processor architecture, exceptions are suppressed to avoid spurious errors. However, it is important to be able to preserve application behavior, even exception generation.