1. Field of the Invention
The present invention relates to computers, and more particularly, to a processor having distributed register caches and a cache coherency protocol for maintaining coherency among the register values in the distributed register caches.
2. Description of the Related Art
Referring to FIG. 1, a pipelined processor according to the prior art is shown. The processor 10 includes, among other elements, an instruction cache 12, an instruction prefetch unit 14, an instruction buffer 16, a dispatch unit 18, a processing unit 20, a register scoreboard unit 22, and a memory hierarchy 24. The processing unit 20 includes one or more pipelines 26a through 26z. The memory hierarchy 24 includes, from top to bottom, a register file (RF) 28, a data cache 30, the instruction cache 12, main memory 32, disk storage 34, and typically external memory (not shown). In some processors, the dispatch unit 18 is capable of issuing multiple (i) instructions per cycle. State of the art processors today can issue up to four (i=4) instructions per cycle.
During each clock cycle, the dispatch unit 18 checks the pipelines 26 available and ascertains the register values needed by the next (i) instructions in the instruction buffer 16 considered for dispatch. For the instructions among the next (i) instructions where resources are available, the dispatch unit 18, checks the register scoreboard 22 to determine if any of the needed register values are currently being recomputed in one of the pipelines 26a through 26z. If a needed register value is immediately available by a bypassing operation, the register value is bypassed to the pipeline 26 that is going to execute the instruction that needs the register value. If the register value is not immediately available because it is being recomputed, the instruction that needs the register value may be stalled. If a needed register value is not in the pipelines 26, the dispatch unit 18 accesses the memory containing the register file 28. The instructions among the (i) instructions where pipeline resources and register values are available are dispatched. When an instruction has completed execution, its results are written back to the register file 28 and made available to subsequent instructions.
The characteristics of the register file 28 are dictated by the instruction set developed for the processor 10. In other words, the instruction set defines the type and size of registers in the register file 28 available to the programmer. For example, the SPARC instruction set V9, jointly developed by Sun Microsystems, Inc., Mountain View, Calif. and SPARC International, Menlo Park, Calif., defines an integer register file having a maximum of five hundred and twenty (520) registers and a separate floating point register file, having up to thirty two (32) registers, and each register being sixty-four (64) bits wide. (Note, for the sake of simplicity, FIG. 1 illustrates a "generic" register file 28, and does not show separate integer and floating point register files.)
State of the art processors, such as the UltraSPARC.TM. processor from Sun Microsystems, the Power PC.TM. from Motorola and IBM, and the Alpha.TM. chip from Digital Equipment Corporation, share a number of similarities. Each of these processors use an on-chip static random access memory (SRAM) array for implementing their respective register files 28. To the best of the Applicants' knowledge, these processors all provide a number of read and write ports to both the integer and floating point register files 28 equal to the maximum number of register read and write ports that may be needed during a "worst case" cycle respectively. For the sake of illustrating a worst case cycle, an example involving the UltraSPARC processor and the V9 instruction set is provided.
The UltraSPARC processor is a four issue processor (i=4) that includes eight (8) pipelines. The eight pipelines include two integer units, one load/store unit; two graphics units; one branch unit; one add floating point unit (FPU); one multiply FPU; one divide/square root FPU. The V9 instruction set defines integer instructions that require up to two source register operands, and one destination operand. Load/store instruction can specify either one, two or three source operands. Floating point instructions can specify up to two source operands and one destination register. Consider a cycle where the four instructions considered for dispatch include three integer operations, each requiring two source operands, and one load/store operation, requiring three source operands. Since the UltraSPARC processor has two integer units and one load/store unit, only the two oldest integer instructions and the load/store instruction can be dispatched in the cycle. Since the resources are not available to dispatch the third integer instruction, it is stalled until a later cycle. Under these conditions, a total of seven (7) register read ports and three (3) register write ports are required for the SRAM containing the integer register file 28 in UltraSPARC. Since no other possible combination of dispatched instruction could require more read ports or write ports, the above the example represents a worst case cycle. Although not described herein, the SRAM containing the floating point register file 28 in UltraSPARC requires five read ports (5) and three (3) write ports. It is believed that the number of read and write ports for the SRAMs containing the register files in the Power PC, the Alpha chip, and other known processors is determined in a similar fashion using a worst case scenario.
A number of problems are associated with using multiport SRAMs to implement a register file 28 for a processor. For each cell in the SRAM array, a wordline, two pass transistors and a differential bit line pair is needed for each read and write port in the array. As a result, the size or pitch of each memory cell is relatively large because of the number of word lines, bit lines, and pass transistors associated with each memory cell. The increased pitch size of the individual cells means that the overall size of the memory array is larger and occupies a larger percentage of the area on the processor die. This detrimentally affects manufacturing yields of the processor, and drives up fabrication cost. The average time required to access a register value in the register file 28 is also adversely affected because of the longer word lines and bit lines, due to the overall larger size of the array. The number of pass transistors, word lines, and bit lines associated with each cell tend to increase the capacitive loading on each cell. The increased capacitive load on each cell makes it more difficult for the finite charge stored in each cell to drive the appropriate differential bit line pair. All the above problems are exacerbated with an increase in the number of pipelines in the processor and an increase in the maximum number of instructions that may be issued per cycle.
Several design trends are proliferating in the processor industry: larger issue processors; a greater number of pipelines in the processor unit; reduced cycle times; larger register files; and wider word widths. The implementation of the register file 28 in the SRAM memory array, with its complex read/write circuitry, relatively large size, and relatively slow access speeds, represents a substantial barrier to improving the performance for each of these trends. In fact, the Applicants believe that the SRAM memory array as described above has created a design impediment that in the next generation of processors, may discourage or even prevent further advancements in scalarity, an increase in word size, an increase in the size of the register file, and/or a reduction of cycle time.