(A) Field of the Invention
The invention relates to the field of optimizing compilers for computer systems, and more particularly, to the field of optimizing compilers for processors with irregular register files.
(B) Description of the Related Art
It is desirable that computer programs be as efficient as possible in their execution time and memory usage. This need has spawned the development of computer architectures capable of executing target program instructions in parallel. A recent trend in processor design is to build processors with increasing instruction issue capability and many functional units. For example, architecture of Parallel Architecture Core (PAC) 10 shown in FIG. 1 illustrates a five-way issue DSP. The PAC processor 10 comprises a first cluster 12A and a second cluster 12B, wherein each cluster 12A or 12B comprises a first functional unit 20, a second functional unit 30, a first local register file 14 connected to the first functional unit 20, a second local register file 16 connected to the second functional unit 30, and a global register file 22 having a ping-pong structure formed by a first register bank B1 and a second register bank B2. Each register file includes a plurality of registers. Also, the PAC processor 10 comprises a third functional unit 40, which is placed independently and outside the first cluster 12A and the second cluster 12B. A third local register file 18 is connected to the third functional unit 40. The first functional unit 20 is a load/store unit (M-Unit), the second functional unit 30 is an arithmetic unit (I-Unit), and the third functional unit 40 is a scalar unit (B-unit). The third functional unit 40 is in charge of branch operations and also capable of performing simple load/store and address arithmetic. The global register files 22 are used to communicate across clusters 12A and 12B. Only the third functional unit 40, being able to access all global register files 22, is capable of executing such copy operations across clusters 12A and 12B. The first local register file 14, the second local register file 16, and the third local register file 18 are only accessible by the M-Unit 20, I-Unit 30, and B-Unit 40, respectively. Each global register file 22 has only a single set of access ports, shared by the M-Unit 20 and I-Unit 30. Each register bank B1 or B2 of the global register file 22 can only be accessed by either the first functional unit 20 or the second functional unit 30 in an operation cycle, so these two functional units 20, 30 can only access different banks B1 or B2 in each operation cycle. This is an access constraint of the ping-pong structure.
The process of optimizing a target program's execution speed centers on scheduling the execution of the target program instructions to take advantage of the multiple computing resource units. One strategy of optimization is to focus on loops in code, where in many applications the majority of execution time is spent. Software pipelining (SWP) is a loop optimization technique for PAC architectures. By overlapping the execution of the loop body, SWP increases the instruction-level parallelism (ILP) thus maximizing the performance of PAC architectures. FIG. 2 illustrates a software pipelining scheme called modulo scheduling. Parallel instruction processing is obtained by starting an iteration of fixed time intervals (II). The scheduled length of a single iteration is TL 138, and the iteration is divided into stages of length II 126. Loop execution begins with stage 140 of the first iteration 128. During the first II cycles, no other iteration executes concurrently. After the first II cycles, the first iteration 128 enters Stage 1, and the second iteration 136 enters Stage 0. New iterations begin every II cycles until a state is reached when all stages of different iterations are executing. Toward the end of loop execution, no new iterations are initiated, and those that are in various stages of progress gradually complete. Three phases of loop execution are termed the prologue 130, the kernel 132 and the epilogue 134. Since smaller II values imply higher throughput, almost all scheduling techniques attempt to derive a schedule that minimizes the II value. After reaching a valid schedule, the registers of the scheduled instruction are allocated to the register file. Since the execution of loop iteration is overlapping, and the accessibility of the global register file is limited, the process of pipelining loop instructions becomes more complicated.