The basic operations carried out by a processor are reading data, computing, and writing data as specified by sequences of instructions in a program. The data read or written by a processor are usually stored in variety of storage media such as disks, memory, cache, or registers. A significant portion of the instructions in the instruction set of a processor access data stored in registers, which is the default location for the most frequently used data.
A typical register file consists of an array of registers for storing a specific type of data such as integer and floating-point. The instructions executed by a processor use many on-chip resources such as function units, registers, buses, cache, etc in order to carry out computations specified by the instruction sequences in the program. In a Very Long Instruction Word (VLIW) processor, a compiler (instead of hardware) is used for statically scheduling instructions, and keeping track and reserving resources used by the instructions. A VLIW consists of a set of instructions that can be issued in the same cycle for parallel execution taking advantage of the instruction-level parallelism (ILP) in the program. In a clustered VLIW processor, the on-chip resources are divided into a number of clusters. In a typical clustered VLIW processor, each cluster may contain a subset of function units, a local register file and a local cache. Often the intermediate results of computation produced in a cluster are needed in the same cluster as well as other clusters. Inter-cluster copy instructions are used when such data needs to be transferred to a different cluster(s). Such inter-cluster copy instructions make use of interconnect structures such as shared or point-to-point inter-cluster communication buses.
The register files used in a clustered processor are either partitioned or replicated register files. In the replicated register file scheme, each local register file shares the entire architected register name space, necessitating inter-cluster copy operations to maintain coherency among local register files. On the other hand, in a partitioned register file scheme, the register name space is partitioned and allocated to a set of physically separate register files associated with each cluster. Partitioned register files have been used as local register files in clusters, especially in clustered VLIW processors, for more than a decade, mainly for reducing the number of ports.
The main advantages in using partitioned register files with smaller number of ports compared to a single centralized register file are reduction in area, access delays and power. However, all the advantages due to clustering, in particular when partitioned register files are used, comes at the cost of reduced performance due to the following: 1. increase in the execution time of programs due to inter-cluster copy instructions that are needed to move data between partitioned register files, and 2. increase in code size due to he extra inter-cluster copy instructions that are to be inserted in the program. A summary of relevant related art in partitioned register files for VLIW processors is described below. The article by R. P. Colwell et al. entitled “A VLIW Architecture for a Trace Scheduling Compiler” in proceedings of the second International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS II) in SIGPLAN Notices, vol. 22, no. 10, pp.180-192, October 1987 describes a VLIW processor with partitioned register files.
Explicit inter-cluster copy operations are scheduled by the compiler for accessing registers from a remote cluster. A. Capitanio, N. Dutt and A. Nicolau in their article “Partitioned Register Files for VLIWs: A Preliminary Analysis of Trade-offs” published in the proceedings of the 25th Annual International Symposium on Microarchitecture, pp. 292-300, December 1992 describes yet another clustered VLIW processor with limited connectivity among clusters, which also needs inter-cluster copy instructions to access registers from remote clusters.
A different type of partitioned register file with an attached caching register buffer structure is described in the Ph.D. thesis entitled “Microarchitectures and Compilation Support for Clustered Instruction-level Parallel Processors”, University of Maryland, College Park published in March. 2001 authored by Kailas and in a EuroPar 2002 conference paper entitled “A Partitioned Register File Architecture and Compilation Scheme for Clustered ILP Processors” by Kailas et al. Their technique reduces the number of inter-cluster copy operations by combining several inter-cluster copy instructions into a new single “sendb” instruction which carries out a selective broadcast of register value to caching register buffers associated with the destination clusters. U.S. Pat. No. 6,282,585B1, issued on Aug. 28, 2001 in the name of Batten et al. entitled “Cooperative interconnection for reducing port pressure in clustered microprocessors” describes three techniques to reduce the port requirements of clustered processors—register file replication, duplicating interconnect using multiple global move units, and splitting inter-cluster copy instructions into two sub-instructions. These techniques, however, do not solve the problem of large number of copy instructions required for inter-cluster communication. U.S. Pat. No. 7,114,056, issued on Sep. 26, 2006, in the name of M. Tremblay and W. Joy entitled “Local and global register partitioning in a VLIW processor”describes a register file partitioning scheme for a VLIW processor in which each partition register file is further partitioned into global and local such that global registers are kept coherent across all function units by broadcasting the write operations. While this scheme may help avoid explicit inter-cluster copy operations by using the replicated register file approach to the global registers in each register file, it suffers from all the drawbacks of replicated register file scheme such as large inter-cluster bandwidth and large number of inter-connect paths.
From the above discussion, it follows that all of the prior art suffer from the drawbacks associated with partitioned register files used in clustered VLIW ,processors such as increase in code size due to large number of inter-cluster copy instructions, and performance loss due to inter-cluster copy instructions stretching of critical paths in programs.