1. Field of the Invention
The present invention relates to a method for allocating registers for a processor, and more particularly, to a method for allocating registers for a Parallel Architecture Core (PAC) processor.
2. Description of the Related Art
Instruction-level parallelism (ILP) is increasingly deployed in high-performance digital signal processors (DSPs) with very long instruction word (VLIW) data-path architectures. Such DSPs usually have multiple functional units, and the number of read/write ports connecting register files increases with the number of functional units. The distributed register-file design is being adopted to reduce the amount of read/write ports in registers. The distributed register-file design includes features such as multi-cluster register files, multiple banks, and limited temporal connectivities such as ping-pong architectures. These architectures have been shown to be able to reduce the numbers of read/write ports in registers and reduce power consumption while sustaining high ILP in VLIW architectures.
FIG. 1 illustrates the architecture of a PAC processor utilizing a ping-pong architecture. The PAC processor 10 comprises a first cluster 12A and a second cluster 12B, wherein each cluster 12A and 12B comprises a first functional unit 20, a second functional unit 30, a first local register file 14 connected to the first functional unit 20, a second local register file 16 connected to the second functional unit 30, and a global register file 22 having a ping-pong structure formed by a first register bank B1 and a second register bank B2. Each register file includes a plurality of registers. Also, the PAC processor 10 comprises a third functional unit 40, which is placed independent of and outside the first cluster 12A and the second cluster 12B. A third local register file 18 is connected to the third functional unit 40. The first functional unit 20 is a load/store unit (M-Unit), the second functional unit 30 is an arithmetic unit (I-Unit), and the third functional unit 40 is a scalar unit (B-unit). The third functional unit 40 is in charge of branch operations and also capable of performing simple load/store and address arithmetic. The global register files 22 are used to communicate across clusters 12A and 12B; only the third functional unit 40, being able to access all global register files 22, is capable of executing such copy operations across clusters 12A and 12B. The first local register file 14, the second local register file 16, and the third local register file 18 are only accessible by the M-Unit 20, I-Unit 30, and B-Unit 40, respectively. Each global register file 22 has only a single set of access ports, shared by the M-Unit 20 and I-Unit 30. Each register bank B1 or B2 of the global register file 22 can only be accessed by either the first functional unit 20 or the second functional unit 30 in an operation cycle, so these two functional units 20, 30 can only access different banks B1 or B2 in each operation cycle. This is an access constraint of the ping-pong structure.
The presence of distributed register-file architectures featuring multi-bank register files, distributed register clusters, and limited temporal connectivities in embedded VLIW DSPs presents challenges for compilers attempting to generate efficient codes for multimedia applications. Research on compiler optimizations to address this issue first addressed issues related to cluster-based architectures. This includes partitioning register files to work with instruction scheduling, and loop partitions for clustered register files. To address more general issues dealing with multi-bank register files, distributed register clusters, and ping-pong architecture on embedded VLIW DSPs, Lee et al. (U.S. Pat. No. 7,650,598B2) provided a pioneering ping-pong-aware local-favorable (PALF) scheme, which uses a local-favorable method for bank assignments that applies graph partitioning for assigning multi-bank and multi-cluster register files. This scheme first performs partitioning, followed by register allocation and instruction scheduling phases.
However, the method provided by Lee et al. does not take the cycle information of the instructions into account. As a result, the register allocation provided accordingly often fails to reach a preferable result. Therefore, the present invention provides an improved method for allocating registers for a PAC processor utilizing the ping-pong architecture that achieves a better solution compared to the prior art method.