1. Field of the Invention
The present application relates generally to data processing and, in particular, to compilation of source code to generate executable code. Still more particularly, the present application relates to a compiler method for employing multiple autonomous synergistic processors to operate simultaneously on long vectors.
2. Description of the Related Art
A single instruction multiple data (SIMD) data processing system is a computer that can perform a single operation on multiple sets of data. For example, a SIMD data processing system may, for example, add or multiply sets of numbers at the same time. Performing a single operation on multiple sets of data in parallel is referred to as “SIMDization” or “vectorization.” The term SIMDization is used when referring to “short” vectors, such as those that fit into a 128-bit wide register in a processor. Vectorization is a broader term that is typically used to refer to longer vectors and may include the shorter vectors. Vectorization is typically used to operate on two or more groups of data or array elements at the same time, for example for multimedia encoding and rendering as well as scientific applications. Hardware registers are loaded with numerical data and the computation is performed on all data in a register, or even a set of registers, simultaneously.
In a computer processor that has a principal processor and multiple ancillary processors capable of executing SIMD instructions, a developer may write code for execution on the ancillary processors to take advantage of their SIMD execution characteristics and write code for the principal processor, to manage the data transfer between and synchronization with the ancillary processors. The code which executes on the principal processor will run sequentially. In other words, none of the computations will be performed in parallel. This type of code is referred to as a sequential code.
Programmers write code for the ancillary processor to execute SIMD instructions by using the language provided intrinsics or built-in functions or by employing the automatic vectorization features of a compiler. A SIMD instruction is an instruction, which operates on multiple data elements in parallel. Examples of such instructions include those which operate on 2 double precision data elements, 4 integer data elements or 8 byte data elements. SIMD parallelism is an ability to detect at compile time that subsets of data may be operated on in parallel, determine when these types of analyses are performed and generate code to use SIMD instructions.
Exploiting this parallelism in processing data on a single SIMD accelerator, is a complex task for a programmer and requires Et high degree of manual intervention. An example of a SIMD accelerator is a synergistic processor element which is found in a multi-core processor, such as the Cell Broadband Engine™ processor, which is available from International Business Machines Corporation. “Cell Broadband Engine” is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both, and is used under license therefrom. Furthermore, to exploit this parallelism across multiple SIMD accelerators is even more difficult and requires the programmer to be aware of more than just the available SIMD parallelism. The programmer must also be concerned with the placement of the code on each of the SIMD accelerators, synchronization of the execution of this code, and the placement and fetching of the data to the appropriate accelerators to ensure the highest performance execution.
Current approaches to exploiting SIMD parallelism require the programmer to use intrinsics. Intrinsics are built-in functions provided by the language to allow the user to invoke vector instructions directly and limit the exploitation to a single ancillary accelerator or processor. To harness the parallelism across multiple ancillary SIMD accelerators one could insert directives called pragmas, which are source level instructions to the compiler. Alternatively, coarse grained auto-parallelization techniques, such as those at the loop level, can be used. Both of these approaches, however, have the potential to introduce the overhead of scalar execution on the ancillary accelerators or processors, since the typical parallelizable loop may contain more than just strictly vectorizable computation. A loop is a repetition of instructions in a program. The manner in which vectorization is performed in the above-described approach requires that a parallel loop be outlined. After outlining, vectorization opportunities are detected within that particular loop. The resulting loop is then prepared for execution across the principal processor and all the ancillaries. Vectorization opportunities within these outlined loops are constrained by any limitations imposed by known automatic SIMDization techniques. Specifically, this type of approach confines the generation of SIMD code to stride-1 array accesses, that is to say, accesses wherein each array element access is contiguous, as in a(i), a(i+1), a(i+2) as opposed to non-contiguous as in a(i), a(i+3), a(i+6).
In most multi-core processors, a programmer currently has to create or modify an existing application to efficiently use different execution units. Currently, a programmer manually creates an application or transforms an existing application such that the principal processor element (PPE) processor provides the control functions and the multiple synergistic processor elements (SPEs) operate in parallel on all the numeric or compute-intensive sections of the application.
The developer typically writes code to use each SPE for frequently repeated tasks to take advantage of the SIMD instruction set either through the use of SIMD intrinsics or by availing of the automatic support in an automatic SIMDizing compiler. Programmers typically write code in which the PPE controls and sets up the global synchronization. The operating system runs on the PPE and allocates resources, controls devices, and provides system services. Programmers write code to use the PPE for less frequently repeated tasks with more random data access.
However, currently the programmer writes the SPE code and separate PPE code manually. The programmer specifically develops SPE code so that the SPE executable code is correctly synchronized with the PPE by linkage or runtime library code.