1. Field of the Invention
This invention pertains generally to processor architecture, focussing on the execution units. More particularly this invention is directed to an improved processor using clustered groups of execution units visible at the macro-architecture level, facilitating improved parallelism and backwards compatibility in a processor instruction set.
2. The Prior Art
As reliance on computer systems has increased so have demands on system performance. This has been particularly noticeable in the past decade as both businesses and individual users have demanded far more than the simple character cell output on dumb terminals driven by simple, non-graphical applications typically used in the past. Coupled with more sophisticated applications and internet use, the demands on the system and in particular the main processor are increasing at a very high rate.
As is well known in the art a processor is used in a computer system, where the computer system as a whole is of conventional design using well known components. An example of a typical computer system is the Sun Microsystems Ultra 10 Model 333 Workstation running the Solaris v.7 operating system. Technical details of the example system may be found on Sun Microsystems"" website.
A typical processor is shown in block diagram form in FIG. 1. Processor 100 contains a Prefetch And Dispatch Unit 122 which fetches and decodes instructions from main memory (not shown) through Memory Management Unit 110, Memory Interface Unit 118, and System Interconnect 120. In some cases, the instructions or their operands may be in non-local cache in which case Prefetch And Dispatch Unit 122 uses External Cache Unit 114 to access external cache RAM 116. Instructions that are decoded and waiting for execution may be stored in Instruction Cache And Buffer 124. Prefetch And Dispatch Unit 122 detects which type of instruction it has, and sends integer instructions to Integer Execution Unit 126 and floating point instructions to Floating Point Execution Unit 128. The instructions sent by Prefetch And Dispatch Unit 122 to Integer Execution Unit 126 contain register addresses, typically two read locations and one write location, where the read locations are the values to be operated on and the write location is where the result will be stored.
FIG. 1 has one integer and one floating point execution unit. To improve performance parallel execution units were added. One parallel execution unit implementation is shown in FIG. 2. To avoid the confusion and surplus verbiage caused by the inclusion of non-relevant portions of the processor, FIG. 2 and subsequent drawings show only the relevant portions of a processor. As will be appreciated by one of ordinary skill in the art, the portion of a processor shown is functionally integrated into the rest of a processor.
Integer Register File 200 is used by Integer Execution Units 208 and 210, as well as any other integer execution units that could be connected. Floating Point Register File 202 is used by Floating Point Execution Units 212 and 214, as well as any other floating point execution units that could be connected. Also shown are Bypass Circuits 204 and 206. Bypass circuits are needed because one execution unit can attempt both a read and a write to a particular register, or one execution unit may be reading a register in its corresponding register file while another is trying to write to the same register. Depending on the exact timing of the signals as they arrive over the data lines from one or both execution units, this can lead to indeterminate results. Bypass Circuits 204 and 206 detect this condition and arbitrate access. The correct value is sent to the execution unit executing a read, and the correct new value into is written into the register. The circuitry needed to do this is complex for more than one execution unit.
Additional execution units need additional register ports to read and write the register files. The complexity of the bypass circuitry rises as the square of the number of register ports attached; for n register ports on a register file the complexity of the bypass circuitry rises as n2. Thus, having too many execution units attached to a register file will slow performance due to the additional complexity of the register file""s support circuitry.
Referring now to complexity in general, complexity is an abstract metric of the cost of implementing a given mechanism or feature. Complexity translates most directly into the size of the needed circuits. Higher complexity also correlates with higher latency in the circuitry for most circuits, and higher latency means decreased performance. This means it is generally critical to keep complexity to a minimum; otherwise performance begins to decrease which almost always defeats the purpose of the added circuitry.
In addition to the complexity associated with the number of attached execution units and bypass circuitry, a primary bottleneck on the size of register files is the number of ports that must be made available to read and write the registers. The complexity associated with the number of ports is proportional to the square of the total number of ports on a register file. Since there are typically two read operations for every write operation (i.e., most instructions read two values from a register file and write a resulting value), register files typically have two read ports for every write port. If a register file has 8 read ports and 4 write ports, its relative order of complexity would be on the order of (8+4)2=144 with 12 ports, when compared to other register files with other numbers of ports. Using the same register file but trying to increase its throughput by increasing the number of read ports by 4 and the number of write ports by 2 yields a relative order of complexity of (12+6)2 =324 with 18 ports. As an alternative, adding a duplicate of the original register file yields a relative order of complexity of (8+4)2+(8+4)2=244 with 24 ports. Thus, using more register files with fewer ports per register file adds less complexity with more ports (for more throughput) than trying to increase the number of ports on a single register file.
The desirable goal of making more registers visible to the programmer and/or compiler is also difficult. In addition to other complexity considerations, the complexity of any register file grows linearly as the number of visible registers grows. To address additional visible registers, more bits in each instruction are needed. This is often not possible given the limited encoding space (field size) of existing instruction set architectures, or is prohibitively expensive in terms of complexity and cost for new instruction sets.
A new architecture was introduced to address some of the complexity issues associated with the need for increased throughput of the register files. It is based on the principle that many ports can be physically implemented with multiple smaller register files. Each smaller register has the same number of total write ports the single register file implementation would have, but a smaller number of read ports. When an implementation uses more than one physical register file, all the register files that takes the place of the single register files are copies of one another. Since the register files are all copies of one another, a write of any one location in one register file is actually performed as a parallel write to all the small register files. Thus, the number of write ports would stay roughly the same when compared to a large register file. However, the number of read ports may be reduced as only local execution units would read from a given register file rather than all the execution units. This reduces the amount of reads going through any given register file, requiring fewer read ports per register file, and therefore the total number of read ports, when compared to a single large register file. This is an additional complexity savings over that already discussed. Continuing with the example started in the paragraph before last, a single 8-read, 4-write port register would not actually be replaced with two 12-read, 6-write register files; rather, it would be replaced with two 4-read, 4-write register files. The complexity measure of the two smaller register files would now be 2*(4+4)2=128. Compare this with a complexity rating of 328 or even 244 for the other solutions. Using two smaller register files will always minimize complexity while adding register ports. It is important to remember the smaller register files function like a single register file from the perspective of the programmer or compiler. Thus, multiple smaller register files do not address other issues such as the complexity associated with making more visible registers available to a user.
The bypass circuitry can be made hierarchical at the granularity of the replicated register files to reduce its complexity as well. However, in this case the complexity reduction comes with a potential performance penalty. If there are any dependencies between instructions running on different execution units the processor may stall waiting for a completion instead of being able to bypass values.
Generally, instructions are steered to an execution unit by the hardware based on the type of instruction it is (e.g., integer or floating point). The programmer or the compiler, given the view of a single uniform register file, has no control over the steering of instructions.
Making maximum use of the above results, processors were designed with multiple register files coupled to multiple execution units. This architecture is shown in FIG. 3. A series of register files is implemented, divided into two groups. Group one is shown starting at Register File 300 and ending with Register File 304, group two is shown starting at Register File 320 and ending with Register File 324. A plurality of register files exist between Register Files 300 and 304, and between Register Files 320 and 324. Each of the two groups of register files is assigned to one type of execution unit.
Group one, having Register Files 300 and 304, are connected to integer execution units. Integer Execution Units 308 and 310 are shown connected to Register File 300. There will typically be more integer execution units implemented between Integer Execution Units 308 and 310, all connected to Register File 300. Bypass Circuit 302 handles contention and data integrity issues with multiple simultaneous access to the same location over the address space of Register File 300.
For each register file between and including Register Files 300 and 304, there will be a bypass circuit and a set of integer execution units, as explained in the paragraph above.
The second group of register files, shown as Register Files 320 and 324 and including further register files between them, are each connected to a number of floating point execution units.
Register File 320 is shown connected to Floating Point Execution Units 330 and 332. There may be further floating point execution points implemented between the two shown. Bypass Circuit 322 handles the contention and data integrity issues by detecting attempted simultaneous reads/writes to the same address in Register File 320, arbitrating all the floating point execution units to which Register File 320 is attached.
The functional unit just described containing Register File 320, Bypass Circuit 322, and at least Floating Point Execution Units 330 and 332, is duplicated a number of times. The last functional unit is shown as Register File 324, Bypass Circuit 326, and at least Floating Point Execution Units 334 and 336. There will ordinarily be more of these functional units between the first and last just described.
It must be emphasized that all the integer register files function like a single integer register file when viewed from outside the processor, and that all the floating point register files function like a single floating point register file when viewed from outside the processor. The visible external difference between processors implementing an architecture exemplified by FIG. 3 and an architecture as exemplified in FIG. 1 is better throughput; the architectural differences (multiple register files, multiple execution units) are not seen.
From FIG. 1, Instruction Fetch And Dispatch Unit 122 loads the same values in the same relative locations in all integer register files, Register Files 300 and 304 in FIG. 3. Instruction Fetch And Dispatch Unit 122 loads the same values in the same relative locations in all floating point register files, Register Files 320 and 324. The two register file groups are different because values for different instructions are sent to each groupxe2x80x94all integer values to one and floating point to the other. The dotted-line boxes outline the execution units and register files that are copies of each other (the register files) or contain the same type of execution units. Dotted-line Box 340 enclose the integer register files while Dotted-line Box 342 the floating point register files. Similarly, all the execution units within Dotted-line Box 350 are integer execution units, and those in Dotted-line Box 352 are all floating point execution units. As viewed from outside the processor, Dotted-line Boxes 340 and 350 function like the single Register File 138 and single Execution Unit 126 from FIG. 1, and Dotted-line Boxes 342 and 352 function like the single Register File 136 and single Execution Unit 128 from FIG. 1.
Although this provides increased parallelism by allowing more execution units to operate in parallel at the instruction level, the addition of register files within Dotted-line Boxes 340 or 342 and the addition of execution units within Dotted-line Boxes 350 or 352 are invisible at the macro-architecture level. Anything not inherently parallel at the instruction level cannot make use of any additional execution units or register files.
The prior art methods used to increase throughput by increasing parallelism have reached a limit. The size of individual register files are at an upper bound due to the complexity discussed above as well as the problem of adding addressing bits within an instruction which would be required for larger register files; the number of execution units that can be connected to each register file is at an upper limit due to limits on throughput and connectivity complexity; and, the number of register file/execution unit combinations reaches an upper bound due to diminishing returns on adding parallelism that can only be exercised on a per instruction level.
Given the ever increasing demand to increase system throughput and therefore processor throughput, there is an urgent need to identify and make useful any additional parallelism that can be found within the processor, at the instruction level and process level. There is an additional need to make this increase in throughput available to both legacy software and new software.
It is therefore a goal of this invention is to provide a method and system for increasing processor throughput by increasing the parallelism available within a processor. It is a further goal of this invention to make the improved parallelism available to both legacy and new software.
The present invention significantly increases parallelism in a processor by implementing a new architectural feature called a register domain. A register domain is a single logical register file and the execution units coupled to it, where the execution units may be of mixed types (integer and floating point). Each register domain""s logical register file is an independent set of registers from all other logical register files. In a significant departure from prior art processors, register domains are visible to the user who may direct individual instructions or instruction streams to the register domain of the user""s choice. The combination of the register domains and their direct controllability by users greatly increases parallelism within the processor as well as the parallelism available to the user as compared to traditional processors.