1. Technical Field
The present application generally relates to an improved system and method for compiling source programs to a machine language representation. More particularly, the present application is directed to a system and method for compiling scalar code for a data-parallel execution engine, such as a SIMD execution engine.
2. Description of Related Art
Contemporary high-performance processor designs provide data-parallel execution engines to increase the amount of performance available to application programs by using single-instruction multiple-data (SIMD) parallelism. These instructions encompass a variety of instruction set extensions, such as the IBM Power Architecture™ Vector Media extensions (VMX).
In high-level languages, data parallel execution can be achieved by using programming based on intrinsics, a form of inline assembly wherein assembly instructions are expressed in the form of pseudo-function calls. Moreover, the compiler can provide important functions such as register allocation using advanced register allocation techniques. In addition, data parallel execution may be achieved by vectorizing compilers that detect code blocks which can be transformed to exploit the data-parallel SIMD instructions.
Using data-parallel SIMD instructions to increase compute performance of a microprocessor has yielded significant increases in speed for many applications. However, extensive use of data-parallel execution engines has also revealed a number of shortcomings of the implementation and use of microprocessors using such extensions. Specifically, these shortcomings relate to the cost of combining scalar and vector data in computations, such as when a vector stored in a vector register needs to be scaled by a scalar value stored in a scalar register, and to the cost of implementing separate scalar and data-parallel SIMD execution units.
In accordance with prevailing implementations, to computationally combine vector and scalar data, scalar data stored in scalar registers is transferred from the scalar registers to vector registers, before computations are performed. Typically, the transfer of data from one class of register file to another class of register file is performed by storing data to memory using a store instruction from the first class of register file, and reloading them into the second class of register file using a load instruction to the second class of register file. Such indirect transfers of data are performed because, due to synchronization requirements, direct transfer between different execution units is complex and expensive.
Moreover, when implementing separate scalar and data-parallel SIMD execution units, functionality is duplicated between scalar execution units and data-parallel SIMD execution units. As a result, a microprocessor may contain a first set of integer execution units to perform integer scalar operations and a second set of integer execution units to perform integer data-parallel operations. Similarly, the microprocessor may have a first set of floating-point execution units to perform scalar floating-point operations and a second set of floating point operations to perform floating-point data-parallel operations. When the parallelism between scalar and data-parallel execution units cannot be exploited by applications, this duplication of execution units disadvantageously and needlessly leads to increased chip area, power dissipation, design complexity and verification cost.