1. Field of the Invention
This invention pertains generally to processor architecture, focussing on the register files used by execution units. More particularly this invention is directed to an improved processor using a hierarchical register file architecture, where the hierarchical register files are visible at the macro-architecture level, facilitating improved performance and backwards compatibility in a processor instruction set.
2. The Prior Art
As reliance on computer systems has increased so have demands on system performance. This has been particularly noticeable in the past decade as both businesses and individual users have demanded far more than the simple character cell output on dumb terminals driven by simple, non-graphical applications typically used in the past. Coupled with more sophisticated applications and internet use, the demands on the system and in particular the main processor are increasing at a very high rate.
As is well known in the art a processor is used in a computer system, where the computer system as a whole is of conventional design using well known components. An example of a typical computer system is the Sun Microsystems Ultra 10 Model 333 Workstation running the Solaris v.7 operating system. Technical details of the example system may be found on Sun Microsystems' website.
A typical processor is shown in block diagram form in FIG. 1. Processor 100 contains a Prefetch And Dispatch Unit 122 which fetches and decodes instructions from main memory (not shown) through Memory Management Unit 110, Memory Interface Unit 118, and System Interconnect 120. In some cases, the instructions or their operands may be in non-local cache in which case Prefetch And Dispatch Unit 122 uses External Cache Unit 114 to access external cache RAM 116. Instructions that are decoded and waiting for execution may be stored in Instruction Cache And Buffer 124. Prefetch And Dispatch Unit 122 detects which type of instruction it has, and sends integer instructions to Integer Execution Unit 126 and floating point instructions to Floating Point Execution Unit 128. The instructions sent by Prefetch And Dispatch Unit 122 contain register addresses, typically two read locations and one write location, where the read locations are the values to be operated on and the write location is where the result will be stored.
FIG. 1 has one integer and one floating point execution unit. To improve performance parallel execution units were added. One parallel execution unit implementation is shown in FIG. 2. To avoid the confusion and surplus verbiage caused by the inclusion of non-relevant portions of the processor, FIG. 2 and the drawings following it show only the relevant portions of a processor. As will be appreciated by one of ordinary skill in the art, the portion of a processor shown is functionally integrated into the rest of a processor.
A register file, Integer Register File 200, is shown connected to Integer Execution Units 208 and 210 through Bypass Circuit 204. There may be any practicable number of additional integer execution units between Integer
Execution Units 208 and 210. Another register file, Floating Point Register File 202, is shown connected to Floating Point Execution Units 212 and 214 through Bypass Circuit 206. As with the integer execution units, there may be any practicable number of additional floating point execution units between Floating Point Execution Units 212 and 214.
Bypass circuits are needed because it can be the case that one execution unit is attempting to both read a value and write a result to a particular register, or one execution unit may be reading a register in its corresponding register file while another is trying to write to the same register. Depending on the exact timing of the signals as they arrive over the data lines from one or both execution units, this can lead to indeterminate results. Bypass Circuits 204 and 206 detect this condition and arbitrate access. The correct value is sent to the execution unit executing a read, and the correct new value into is written into the register.
The circuitry needed to do this is complex for more than one execution unit, being dependant on the number of register ports attached to one register file. Generally, the complexity of the bypass circuitry rises as the square of the number of register ports a register file has; for n register ports on a register file the complexity of the bypass circuitry rises as n2.
In addition to the complexity associated with the number of attached execution units and bypass circuitry, a primary bottleneck on the size of register files is the number of ports that must be made available to read and write the registers. The complexity associated with the number of ports is proportional to the square of the total number of ports on a register file. Since there are typically two read operations for every write operation (i.e., most instructions read two values from a register file and write a resulting value), register files typically have two read ports for every write port. If a register file has 8 read ports and 4 write ports, its relative order of complexity would be on the order of (8+4)2=144 with 12 ports, when compared to other register files with other numbers of ports. Using the same register file and trying to increase its throughput by increasing the number of ports, as an example increasing the number of read ports by 4 and the number of write ports by 2, yields a relative order of complexity of (12+6)2=324 with 18 ports. As an alternative, adding a duplicate of the original register file yields a relative order of complexity of (8+4)2+(8+4)2=244 with 24 ports. Thus, using more register files with fewer ports per register file adds less complexity with more ports (for more throughput) than trying to increase the number of ports on a single register file.
In addition to the complexity just discussed, there are other considerations that limit the size of register files. One problem is physically adding more address and data lines, and the extra length and longer propagation times associated with the extra length. This is a concern since a register file is usually doubled in size with each increase. The accompanying increase in the number of address and data lines, and the increase in individual lengths and associated propagation delays, run directly counter to the need to increase clock speeds in the processor.
Another problem is addressing the individual registers. To address each of 32 registers in a typical register file requires 5 bits. An example of this addressing may be found in Sun Microsystems UltraSPARC II processor, technical details being available on Sun's website. Each instruction typically has addresses for two values to be read and operated on, and one address to write the resulting value into. Thus, for register files having 32 registers, a total of 15 bits (5 per address) must be allocated per instruction out of a limited number of bits available in each instruction. To add larger register files, for example to make the register files in an UltraSPARC II processor 64 registers long instead of 32 registers, requires that additional bits in each instruction be permanently allocated for addressing. In the case of registers with 64 registers, an additional address bit per address field is needed over register files with 32 registers, for a total of 3 additional bits per instruction. This is a real problem when improvements are being made to an existing architecture. Typically, each word in the existing instruction set is full (all the bits are in use), so no more bits can be allocated to addressing. Even if some instructions have unused bits, it must be the case that the extra address bits be available in all instructions. If they aren't, this causes other problems such as adding considerable complexity and lack of backward compatibility into microcode.
For the reasons just discussed, adding register file space by increasing the size of the register file is not practical.
In spite of the problems just discussed, the increased parallelism achieved by connecting multiple execution units to one register file has added pressure to increase the number of registers available. Each execution unit may wish to use anywhere from one or more depending on the instructions and operands it is using. This leads to a contention for register space between the execution units, and limits the number that can be connected before there are diminishing returns due to the lack of registers available.
Thus, there are restrictions that necessitate keeping register files at their current size, yet there is a tremendous need for more locally available registers as well.
It is therefore a goal of this invention is to provide a method and system for increasing the throughput of execution units connected to register files by increasing the amount of locally available registers. The goals of increasing the number of locally available registers in the present invention must be achieved without increasing the size of the register files currently in use.