The invention relates to parallel compiler technology, and specifically relates to compiler methods for reducing control cost in parallel processes.
Parallel compilers are used to transform a computer program into parallel code that runs on multi-processor systems. Traditionally, software developers design the compiler to optimize code for a fixed type of hardware. A principal objective of the compiler is to organize the computations in the program so that sets of computational tasks in the program may be executed concurrently across multiple processors in the specified hardware architecture.
Parallel compiler technology extends across a broad range of parallel computer architectures. For example, the multi-processor architecture may employ shared memory in which each processor element shares the same memory space, or distributed memory in which each processor has a local memory.
One area of compiler and computer architecture research focuses on optimizing the processing of computer programs with loop nests. Many computational tasks in software applications are expressed in the form of a multi-nested loop with two or more loops on a block of code called the loop body. The loop body contains a series of program statements, typically including operations on arrays whose elements are indexed by loop indices. Such loop nests are often written in a high level programming language code in which the iterations are ordered sequentially. The processing of the loop nest may be optimized by converting the loop nest code to parallel processes that can be executed concurrently.
One way to optimize loop nest code is to transform the code into a parallel form for execution on an array of processor elements. The objective of this process is to assign iterations in the loop nest to processor elements and schedule a start time for each iteration. The process of assigning iterations to processors and scheduling iterations is a challenging task. Preferably, each iteration in the loop nest should be assigned a processor and a start time so that each processor is kept busy without being overloaded.
Another challenging task is reducing the cost of controlling each processor element in a parallel array. In a naive approach, the processor may have to compute, on the basis of the current time, the vector of loop indices that describes the iteration that it is about to compute, together with many other quantities, such as memory addresses and tests of loop bounds. Due to the complexity of these computations, it is inefficient to re-compute them for each iteration.
The invention provides a method for exploiting temporal recursion to reduce the cost of control code generated in transforming a sequential nested loop program into a set of parallel processes mapped to an array of processors. The method is implemented in a parallel compiler process for transforming a nested loop program into a set of single loops, where each single loop is assigned to execute on a processor element in a parallel processor array.
The method obtains a mapping of iterations of a nested loop to processor elements in the array and a schedule of start times for initiating execution of the iterations on corresponding processor elements in the array. Based on this mapping and iteration schedule, the method generates code to compute iteration coordinates on a processor element for an iteration of the single loop based on values of the iteration coordinates for a previous iteration of the single loop.
In this context, the term xe2x80x9citeration coordinatesxe2x80x9d broadly encompasses different types of coordinates used to reference an iteration or set of iterations of the nested loop. In the implementation, a parallel compiler maps a high level nested loop in sequential form (e.g., C, java, or Pascal code) into set of single time loops, each mapped to a physical processor element. The parallel compiler maps the iterations to virtual processors, where each virtual processor is assigned a set of iterations, and maps clusters of virtual processors to physical processor elements. The iteration coordinates encompass local coordinates of a virtual processor in a cluster as well as quantities that are linearly related to these coordinates. Examples of the coordinates include the global virtual processor coordinates, and global iteration space coordinates (e.g., the iteration vector expressed terms of the loop indices of the original loop nest). Linearly related quantities include memory addresses of array elements read or written in the loop body.
The parallel compiler generates code to compute loop indices and quantities linearly related to these indices based on previous values of these quantities on the same processor element. For loop indices and linearly dependent quantities (such as memory addresses), the parallel compiler selects an arbitrarily small time lag so as to minimize the storage cost. In this approach, the parallel compiler generates a decision tree that implements the computation of iteration coordinates from a value of the coordinates at a previous time.
The parallel compiler also generates code to test certain loop boundary conditions. These tests include tests to determine whether an iteration is at a cluster or tile edge. They also include a test to determine whether an iteration is within the bounds of the iteration space. The values of these tests are boolean values that are temporally periodic. A buffer may be used to propagate these periodic boolean values to subsequent iterations, thereby avoiding the need to perform the test over and over.
The approach outlined above significantly reduces the cost of the control needed to compute loop indices, loop tests, and memory addresses. The parallel compiler generates control code that is efficient (e.g., a look up or add operation) rather than more time consuming arithmetic operations. This efficient form of code is advantageous for applications in which the loop nest code is compiled to an existing processor array architecture and in which the loop nest is transformed into optimized parallel code to be synthesized into a new processor array.
The parallel compiler may generate code to implement the loop tests with predicates, where operations in the loop body are guarded by the predicates. In this case, the values of the predicates are periodic boolean values propagated from a prior iteration, and the loop body may be synthesized into functional units that support predicated execution of the operations in the loop body. This use of predicates makes the mapping of the loop nest to a processor array more flexible because it can be done without the concern that the mapping will result in grossly inefficient control code. The test whether an iteration is scheduled to execute at a given time on a processor element is implemented efficiently with predicated execution of the loop body.
Further advantages and features of the invention will become apparent from the following detailed description and accompanying drawings.