A significant problem with wide-issue load-store microprocessors is port pressure on the register file, i.e., the register file must support a large number of simultaneous accesses, and therefore the register file must have many ports. A filly-connected processor organization has execution units which each have full access to the entire register file. Predicate registers and lock files for both registers and predicates also require a correspondingly large number of ports. Since the number of ports can adversely impact the area, cost and maximum clock speed of the processor, it is generally desirable to keep the number of ports under some small number, such as 16 or 32. Execution units and register files may therefore be "clustered" in order to reduce the number of ports required for all simultaneously-utilized execution units.
A clustered organization, in contrast to a filly-connected organization, has groups, i.e., "clusters," of execution units, each with a portion of the register file. The portion of the register file associated with a given cluster may be referred to as "local" registers. The execution units in a given cluster have full access to the local registers, but limited access to the registers of other clusters. In a clustered organization, the degree of access one cluster has to the others' register files and the interconnection between clusters must be specified. The purpose of clustering is to reduce the register file port pressure. However, the need for some execution units to have global register file access keeps the typical cluster implementation from being truly scalable. In particular, load, store, and branch units, if shared between clusters, generally need global register file access. Register file ports can be shared among units requiring access to them. In this case, techniques for arbitrating among them, and for stalling a unit which is not allowed to use a port it has requested, generally must be provided.
Each type of execution unit in a processor needs a certain number of register file ports to support its operation. With the use of a technique such as virtual single cycle execution, as described in U.S. patent application Ser. No. 09/080,787 filed May 18, 1998 and entitled "Virtual Single-Cycle Execution in Pipelined Processors," it also requires a certain number of ports on a file of lock registers, a logically separate entity. With predicated execution based on architecturally separate predicate registers, a certain number of ports are also required on the predicate file and the predicate lock file.
FIG. 1 summarizes the port requirements for the following types of conventional execution units: branch units, store units, load units, memory units and arithmetic logic units (ALUs). The instructions associated with each of these types of execution units will be described below. Branch units process conditional branch instructions of the form EQU [(p)] branch to r.sub.x if r.sub.y.smallcircle.r.sub.z,
where register r.sub.x contains an instruction address, and registers r.sub.y and r.sub.z contain the values to be compared using the operator .smallcircle.(representing operators such as =, &lt;, &gt;, etc.). The branch instruction requires reads of r.sub.x, r.sub.y and r.sub.z, reads of the locks on r.sub.x, r.sub.y and r.sub.z, and a read of predicatep and the lock on predicate p.
Store units process store instructions of the form EQU [(p)] mem [r.sub.x +r.sub.y ].rarw.r.sub.z.
The store instruction requires reads of r.sub.x, r.sub.y and r.sub.z, reads of the locks on r.sub.x, r.sub.y and r.sub.z, and a read of predicate p and the lock on predicate p. It is assumed for this example that predicate values are never individually stored in memory; for spilling and context switches, a block store instruction should be provided, which would not be executed in parallel with other instructions.
Load units process load instructions of the form EQU [(p)] r.sub.x.rarw.mem[r.sub.y +r.sub.z ].
The load instruction requires reads of r.sub.y and r.sub.z, and a write of r.sub.x. It requires reads of the locks on r.sub.x, r.sub.y, and r.sub.z, and two writes of the lock on r.sub.x, i.e., once to lock it, and once to unlock it. It also requires the read of predicate p and the lock on predicate p. It is assumed for this example that predicate values are never individually loaded from memory; for filling and context switches, a block load instruction should be provided, which would not be executed in parallel with other instructions.
A memory unit can perform either a load or a store on each cycle. Therefore, it has the combined port requirements of a load and store unit. It may seem that the memory unit requires only three total register ports, since it cannot perform both a load and a store simultaneously. However, in a pipelined memory unit, a load followed by a number of stores will require four simultaneous register accesses during the load writeback. Conversely, a store followed by a load will use only two ports when the load is at register read. The average number of ports is three, but the peak is four.
Instructions processed by the ALU may be of the form EQU [(P)] r.sub.x.rarw.r.sub.y.smallcircle.r.sub.z,
where operator .smallcircle.represents &, +, etc., and predicate p, if provided, indicates whether the instruction's results should be written back or annulled. These instructions require reads of registers r.sub.y and r.sub.z and a write of register r.sub.x. They require reads of the locks on r.sub.x, r.sub.y, and r.sub.z, and two writes of the lock on r.sub.x, i.e., one to lock the register at register read, and one to unlock the register at register writeback. Two write ports are required on the lock file for any unit which writes to a register. Even though the first write to the lock (at register read) and the second (at register writeback) are displaced in time, in order to be able to issue an instruction to the unit on every cycle, two write ports must be dedicated to it; if only one is given, the first write for a later instruction and the second write for an earlier instruction will contend for it.
The ALUs may also perform a predicate move instruction, having the form EQU [(p)] p.sub.y.rarw.p.sub.z.
To support this form of an ALU instruction, each ALU requires two predicate read ports, one predicate write port, three predicate lock read ports and two predicate lock write ports. Another form of ALU instruction sets or clears a predicate, based on a comparison between registers, and may have the following form EQU [(p.sub.x)] set p.sub.y if r.sub.y.smallcircle.r.sub.z or [(p.sub.x)] clear p.sub.y if r.sub.y.smallcircle.r.sub.z,
where the operator .smallcircle.represents =, &lt;, etc. The number of ports already provided above will support this form of ALU instruction.
FIG. 2 shows the fully-connected port requirements for exemplary organizations O1 and O2, and a more general processor organization. Organization O1 has one branch unit, one memory unit, and four ALUs. O2 has two branch units, four memory units, and 32 ALUs. The general processor organization has b branch units, l load units, s store units, m memory units, and a ALUs. As noted previously, in a clustered organization, the register files and the set of execution units are partitioned into partially connected groups: each execution unit has full access to the register files in its local cluster, but limited access to the register files in any other cluster; the degree of access and the method of communication between clusters must be specified. A clustered organization with c clusters and e execution units in each cluster has a=ce total execution units in the clusters. An unclustered organization of the same size could be described either as having ce units in one cluster or as having c fully-connected clusters with e execution units in each. Using the latter definition, organizations O1 has four ALUs in a single cluster, and organization O2 has 32 ALUs arranged as four ALUs in each of eight clusters. For these examples, it is assumed that branch, store, load, and memory units are global units, having access to all clusters' register files. In addition, the register files and predicate files can be treated separately. For example, an organization could have a unified, i.e., unclustered, predicate file and a clustered register file. It could even have both the predicate file and the register file clustered, but with different numbers of clusters. Lock files, on the other hand, are logically divided into the same number of clusters as the file they lock; a predicate file, with c clusters, for example, has a corresponding predicate lock file with c clusters. For simplicity of illustration, the examples will deal with register and predicate files partitioned into the same number of clusters.
FIGS. 3 and 4 show the port requirements for the O1, O2 and general examples described above, for write-only cluster interconnection and read-only cluster interconnection, respectively. The terms "write-only" and "read-only" in this context generally refer to whether or not register files and predicate files can be written or read. Whether or not locks must be written or read is a consequence of register and predicate writing and reading. Write-only clustered interconnection allows writing to remote clusters' register files, but does not allow reading from remote register files. Communication takes place by writing values into other clusters. Register locks as in the abovenoted virtual single-cycle execution technique may be used to prevent overwriting registers which are in use. Any ALU may still set the value of a predicate in any cluster, but may not read remote predicates. The ports required by the ALUs (the only non-global execution units) change as a result of the write-only restriction for remote clusters. Register read ports are only required for local ALUs. The lock file port requirements change, since only one lock read port is required for remote ALUs. Likewise, predicate register and predicate lock port requirements change.
As shown in the FIG. 3, for the example organization O2, the write-only interconnection has reduced register file port requirements 47% (from 118 to 62), register lock file port requirements 30% (from 186 to 130), predicate file port requirements 55% (from 102 to 46), and predicate lock file port requirements 34% (from 166 to 110). These improvements have come at the expense of reduced connectivity, forcing the addition of move instructions in some circumstances.
The read-only clustered interconnection allows reading from remote clusters' register files, but does not allow writing. Communication takes place by writing results to the local cluster's register file, and reading from remote clusters' register files. With a read-only interconnection, register and predicate file write ports are only required for local ALUs, not remote ALUs. This also lowers the requirements for lock files. FIG. 4 summarizes the port requirements. Compared to the fully-connected version of example organization O2, the read-only interconnection version of O2 has reduced register port requirements 24% (from 118 to 90), register lock file port requirements 45% (from 186 to 102), predicate file port requirements 27% (from 102 to 74), and predicate lock file port requirements 51% (from 166 to 82). Again, these improvements come at the expense of reduced connectivity, forcing the addition of move instructions in some circumstances.
Although the above-described conventional write-only interconnection and read-only interconnection clustering techniques can provide a significant reduction in port pressure, further improvements are needed. A number of techniques have attempted to provide such improvements. For example, the Digital Equipment Corp. Alpha 21264 processor, as described in L. Gwennap, "Digital 21264 Sets New Standard," Microprocessor Report, Vol.10, No.14, Oct. 28,1996, uses a form of register replication to reduce port pressure. However, this processor allows all execution units to use any register as a source or destination, replicates only registers, not predicates or locks, and accomplishes replication by writing results directly to both replicates of the register file. The number of ports required for replication in this technique is therefore a function of the total number of functional units, which limits scalability. Another known technique for reducing port pressure includes, e.g., multiflow machines using clusters interconnected by busses, as described in, e.g., P. G. Lowney et al., "The Multiflow Trace Scheduling Compiler," The Journal of Supercomputing, Vol. 7, pp. 51-142, 1993. Unfortunately, these and other techniques suffer from a number of significant drawbacks, and have been generally unable to provide further substantial reductions in register port pressure.