The invention relates to methods and environments for mapping source code on a target architecture, preferably a parallel architecture.
Before actual mapping of source code on a target architecture, one often transforms the original first code into a second code, because one expects the execution of said second code on said architecture to be better. Said source code often comprises loops and loop nests. Part of said code transformations are loop transformations, which are the topic of research in many papers and publications.
Said source codes typically have a lot of loops and loop nests. For data transfer and storage optimization, it is essential that global loop transformations are performed, over artificial boundaries between subsystems (procedures). It is exactly at these boundaries that the largest memory buffers are usually needed [K.Danckaert, K.Masselos, F.Catthoor, H.De Man, C.Goutis, Strategy for power efficient design of parallel systems, IEEE Trans. on VLSI Systems}, Vol.7, No.2, pp.258-265, June 1999.]. To optimize these buffers, global (inter-procedural) transformations are necessary. Most existing transformation techniques can only be applied locally, to one procedure or even one loop nest [N.Passos, E.Sha, Achieving full parallelism using multidimensional retiming, IEEE Trans. on Parallel and Distributed Systems, Vol.7, No.11, pp.1150-1163, November 1996.].
Most existing transformation methods assume that there is a dependency between two statements which access the same memory location when at least one of these accesses is a write [W.Shang, E.Hodzic, Z.Chen, On uniformization of affine dependence algorithms, IEEE Trans. on Computers, Vol.45, No.7, pp.827-839, July 1996.]. In this way, output-, anti- and flow-dependencies are all considered as real dependencies.
As for parallel target architectures, the previously applied compiler techniques tackle the parallelization and load balancing issues as the only key point. [S.Amarasinghe, J.Anderson, M.Lam, and C.Tseng, The SUIF compiler for scalable parallel machines, Proc. of the 7th SIAM Conf. on Parallel Proc. for Scientific Computing}, 1995.]. They ignore the global data transfer and storage related cost when applied on data dominated applications like multi-media systems. Only speed is optimized and not the power or memory size. The data communication between processors is usually taken into account in most recent methods [C.Diderich, M.Gengler, Solving the constant-degree parallelism alignment problem, Proc. EuroPar Conference, Lyon, France, August 1996. Lecture notes in computer science, series, Springer Verlag, pp.451-454, 1996.] but they use an abstract model (i.e. a virtual processor grid, which has no relation with the final number of processors and memories). In this abstract model, the real (physical) data transfer costs cannot be taken into account.
To adequately express and optimize the global data transfers in an algorithm, an exact and concise modeling of all dependencies is necessary. The techniques which are currently used in compilers do not use an exact modeling of the dependencies, but an approximation in the form of direction vectors or an extension of this (see [M.Wolf, M.Lam, A loop transformation theory and an algorithm to maximize parallelism, IEEE Trans. on Parallel and Distributed Systems, Vol.2, No.4, pp.452-471, October 1991.] for a detailed description). An example is [K.McKinley, A compiler optimization algorithm for shared-memory multiprocessors, IEEE Trans. on Parallel and Distributed Systems, Vol.9, No.8, pp.769-787, August 1998.], which combines data locality optimization and advanced interprocedural parallelization. However, it does not use an exact modeling, and as a result it cannot analyze the global data-flow.
Many techniques have been developed using an exact modeling, but these have not led to real compilers. The first method which used an exact modeling was the hyperplane method L.Lamport, The parallel execution of do loops, Communications of the ACM, Vol.17, No.2, pp.83-93, February 1974., where a linear ordering vector is proposed to achieve optimal parallelism. It works only for uniform loop nests, and all statements in the loop nest are considered as a whole: they all get the same ordering vector. Some particular cases of the hyperplane method have been proposed too. For example, selective shrinking and true dependence shrinking are in fact special cases of the hyperplane method, in which a particular scheduling vector is proposed.
Often a linear programming method is proposed to determine the optimal scheduling vector for the hyperplane method. Extensions where all statements of a loop nest are scheduled separately are called affine-by-statement scheduling. In [M.Dion, Y.Robert, Mapping affine loop nests: new results, Lecture Notes in Computer Science, Vol.919, High-Performance
Computing and Networking, pp.184-189, 1995.], a further extension is considered, namely for affine dependences. These methods are mainly theoretical: the ILP""s to solve get too complicated for real world programs. Moreover they are designed only to optimize parallelism, and they neglect the data reuse and locality optimization problem.
Some papers have addressed the optimization of algorithms using the polytope model however these methods are quite complex and not suitable for models in the order of hundreds of polytopes.
It is an aim of the invention to provide methods for transforming source code with loop transformations which focus on global data storage and transfer issues in parallel processors and which are feasible for realistic real-world data dominant codes.
In a first aspect of the invention a method (3) for transforming a first code (1) to a second code (2) is disclosed. Said first code and said second code describe at least the same functionality. Said first code and said second code are executable on a predetermined target architecture (4). The invented method transforms said first code into said second code, such that said target architecture can deliver said functionality while executing said code in a more cost optimal way. Said cost can relate to energy consumption of said target architecture but is not limited thereto. Said target architecture comprises at least two processors (5) and at least three levels of storage (6). Said target architecture can be denoted to be a parallel architecture. Each level comprises of storage units, wherein variables or signals, defined in said codes can be stored. Said levels define a hierarchy between storage units. Storage units in a first level are denoted local storage units. Storage units in the second level are less local than the ones in said first level. Storage units in a level N are less local than the ones in Level Nxe2x88x921, Nxe2x88x922, . . . and said first level. Storage units in the highest level are denoted global storage units. Said invented method transforms said codes in order to optimize transfers in between said storage levels, meaning that data transfers between lower level storage units are preferred for optimality. The method thus improves at least the data locality. In each of said processors at least two storage levels can be found. The characteristics of said storage levels within a processor can be considered to be part of internal processor characteristics information. The invented method is characterized in that it exploits such internal processor characteristics information explicitly while transforming said first code into said second code. Said codes considered in the invented method is characterized in that it contains a least of a plurality of loop nests. While transforming said first into said second code said loop nests in said codes are considered simultaneously. While transforming said first code in said second code, said loop nests are taken into account globally, meaning said transformation does not work step-by-step wise, wherein in each step a loop nest is taken into account. Instead the invention considers a substantial subset of said loop nests, meaning at least two loop nests.
The invention can thus be formalized as a method for transforming a first code, being executable on a predetermined parallel architecture, to a second code, being executable on said predetermined parallel, said first code and said second code describing the same functionality. Said predetermined architecture comprises at least two processors and three storage levels, at least two storage levels, being within each of said processors. Said first code comprising at least two loop nests. Said method comprising the steps of: loading said first code; loading internal processor characteristics information of said processors; transforming said first code into said second code, in order to optimize an optimization function, related to at least data locality; said transformation taking into account at least two of said loop nests; said transformation exploiting said internal processor characteristics information of said processor. Said transformation takes into account a substantial subset of said loop nests.
Loop nests in code can be described by their iterators and their iterator space, being the values that said iterators can take while said code is executed. Within each loop nests operations are performed. Said operations use, consume or read variables or signals. For each execution of an operations the variables exploited depends on the values of the iterators of the related loop nest while executing said operation. Said operators also produce new values, to be stored in variables. In which variable said values are stored depends also on the values of the iterators of the related loop nest while executing said operation. When variables produced in a loop nest are consumed in another loop nest a dependency between loop nests is found. More in particular dependencies between points in the iterator space of a first loop and points in the iterator space of a second loop can be defined. In a geometric representation with each combination of iterators of a loop a point in said loops iteration space is associated. A dependency is in such geometric representation associated with an arrow from such a point of a first loop, performing production of values, to a point of a second loop, performing consumption of values. The set of arrows, associated with the loop nest dependencies, defines a pattern. The transforming step of the invented method, comprises of two steps: a so-called placement step and a so-called loop ordering step. In said placement step, the iteration spaces of the individual loop nests found in said first code are grouped in a global iteration space. Said placement is performed such that the pattern defined by the loop nest dependency arrows in said global iteration space, shows at least improved regularity. It is clear that due to the nature of said placement step, information of said loop nests is exploited simultaneously. Said placement is also performed such that dependencies are made shorter. Measures for regularity and distance measures for data locality are presented in the invention.
The method can thus further be described as the method described above wherein said transformation step comprises: associating with said loop nests an iterator space with operations; determining loop nest dependencies; grouping said loop nest iterator spaces in one global iterator space; said grouping optimizing the regularity of the pattern defined by said dependencies in said global iterator space. Alternatively the invented method is described as the method describe above, wherein said transformation step comprises: associating with said loop nests an iterator space with operations; determining loop nest dependencies; grouping said loop nest iterator spaces in one global iterator space; said grouping shortening the dependencies. Naturally a combination of improving regularity and dependency shortening can be done also.
In the invented method after said placement step, a so-called loop ordering step is performed. Said loop ordering step exploit said global iterator space and thus information of the loop nests in said code is taken into account simultaneously. The loop ordering step can be performed by using e.g conventional space-time mapping techniques on said global iterator space, although the invention is not limited hereto. In the invented method space-time mapping techniques, which are adapted for taking into account data reuse possibilities and/or which take into account said internal processor characteristics or information, are presented. Each operation in said iteration space is assigned to one of said processors of said architecture and for each of said operation in said iteration space the time on which said operation will be executed is determined. Ones said assignment or space-time mapping is performed, code can be produced which is associated with the found space-time or processor-time mapping, meaning that when said code is executed on said architecture, said operations are executed on the assigned processor and at the assigned time instance. Said produced code is said second code.
The method described above, can thus further comprise the steps of: assigning a processor and an execution time to each operation in said global iterator space, thereby defining a processor-time mapping, said assigning resulting in optimizing an optimization function, related to data locality; determining said second code, being associated with said processor-time mapping.
In the invented method in either one or both of said placement or loop ordering step data reuse estimates can be exploited. It can be stated that the method described above is exploiting data reuse estimates while grouping or alternatively the method described above is exploiting data reuse estimates while assigning.
After applying the invented code transformation method, the transformed code or second code can then be mapped on processors, thereby exploiting a parallel compiler, customly designed or available programmable software processors, which show characteristics which are substantially similar to the characteristics of the target architecture used in the code transformation. The closer the processor characteristics are resembling the target architecture characteristics, the better the execution of said transformation code, in terms of data transfer and storage costs, will be. Naturally the invented code transformation method can be automated. In the invented method the target architecture model considers the real processors on which the algorithm has to be mapped, and even the real memory hierarchies available in these processors. Especially in the ordering step this information is exploited.
It must be emphasized that the invented transformation strategy is a part of a data transfer and storage methodology, which means that the main goal is not optimizing speed, but optimizing data storage and transfers. Because the performance in current microprocessors is largely determined by the ability to optimally exploit the memory hierarchy, represented here as storage levels, and the limited system bus bandwidth, the invented methodology leads also to an improved speed.
The invented method exploits exact modeling of the algorithms. The placement and ordering step is in the invented method clearly uncoupled and even exploit different cost functions. An important feature of the invented method is that global loop transformations are performed on code or code parts with at least two procedures or with at least two loop nests. In the invention a PDG model of the source code is used. Such PDG model provides an exact and concise modeling of all dependencies in the algorithm, and thus enables expressing and optimizing global data transfers.
The fact that the loop reordering step is split in at least two totally separate phases (placement, ordering) is characteristic for the invention. During the first phase (placement), the polytopes, being a geometric representation of loop nests, are mapped to an abstract space, with no fixed ordering of the dimensions. During the second phase (ordering), an ordering vector in that abstract space is defined. The advantage is that each of these phases is less complex than the original problem, and that separate cost functions can be used in each of the phases. This is crucial for the target domain, because global optimizations on e.g. MPEG-4 code involve in the order of hundreds of polytopes, even after pruning. Note that in an embodiment of the invention in between said placement and ordering step said partitioning step is done.
The invented method works on code in single assignment style. By converting the code to single assignment, only the flow dependencies are kept (i.e. when the first access is a write and the second one is a read). Indeed, only these dependencies correspond to real data flow dependencies, and only these should constrain the transformations to be executed on the code. Converting the code to single assignment, will of course increase the needed storage size at first, but during the in-place mapping stage, the data will be compacted again in memory, and usually in a more efficient way than was the case in the initial algorithm.