1. Technical Field
The present invention generally relates to register files, and more particularly, to methods and apparatus for employing multi-bit register file cells for simultaneous multi-threading (SMT) thread groups. Moreover, method and apparatus are provided for generating specifiers based on thread groups for use with the preceding methods and apparatus, in microprocessors with and without register renaming.
2. Description of the Related Art
Microprocessor technology has relied aggressively on the use of instruction-level parallelism (ILP) techniques, in the form of deep pipelining and parallel superscalar instruction issue to increase the performance of microprocessor systems. Typically, additional mechanisms such as out-of-order execution and register renaming are also supplied in many implementations to further increase the performance of microprocessors and exploit the ILP potential offered by deep pipelining and wide superscalar issue capability.
However, despite the gains seen using these additional techniques such as out-of-order execution and register renaming, the execution capabilities made available by ILP techniques such as deep pipelining and parallel superscalar issue far exceed the gains seen by any single application thread, leaving a large number of resources unused.
To better utilize the resources available in a high ILP processor, hardware-based multithreading schemes were introduced. For example, in one prior art hardware-based multi-threading scheme, two hardware threads can be alternated to cover long latency events in threads, such as caches misses.
A more advanced hardware-multithreading scheme is simultaneous multithreading (SMT), as implemented by the IBM POWER5 microprocessor, and described by Sinharoy et el. in “POWER5 SYSTEM MICROARCHITECTURE”, IBM Journal of Research and Development, Vol. 49, Issue 4/5, July, 2005 (hereinafter referred to as “Sinharoy”), the disclosure of which is incorporated by reference herein. In SMT-based implementations, instructions from multiple threads can be issued simultaneously.
Thus, hardware-based multi-threading has become an important performance enhancer for microprocessors by exploiting underutilized resources. While supporting multiple threads is advantageous, it also requires a significant increase in the state storage of microprocessors, to hold register values and remap tables for a plurality of threads. This increase in resources causes increased latencies, and area increases in a microprocessor design.
Turning to FIG. 1A, a register file 100 in accordance with the prior art is indicated by the reference numeral 100. The register file 100 includes write ports 110, at least one storage element 120, and read ports 130. In one embodiment of a microprocessor with hardware-based multi-threading, this structure is used to support more than one thread by increasing the number of storage elements, while maintaining the architecture of read and write ports.
Turning to FIG. 1B, a two-threaded storage array of conventional storage cells for supporting multiple threads, in accordance with the prior art, is indicated generally by the reference numeral 150. It is to be appreciated that while some of the elements of FIG. 1B, as well as other figures herein, are described as set while showing only one member, it is to be appreciated that such a set may have more than one member, while maintaining the spirit of the present principles.
The two-threaded storage array 150 includes a first set of write ports 152, a first set of storage elements 154, and a first set of read ports 156, used to store and access data corresponding to a first thread (thread 0).
The two-threaded storage array 150 also includes a second set of read ports 158, a second set of storage elements 160, and a second set of read ports 162, used to store and access data corresponding to a second thread (thread 1).
Turning to FIG. 2, an arrangement of six read ports for a register file, in accordance with the prior art, is indicated generally by the reference numeral 200.
The arrangement of read ports 200 corresponds to one implementation of register file 100 shown in FIG. 1A. A plurality of bit (storage) cells 210 corresponding to a bit position (shown as bit 0 in exemplary fashion) of registers R00 to R31 are connected to read-multiplexers 220 of read ports (6 read ports shown in exemplary fashion), each being implemented with selection logic 230 to select data from the plurality of bit cells 210. This design is flexible, and allows each read port to select any register file for a read, but leads to significant wiring needs which increase the register file area.
Turning to FIG. 3, a register rename architecture for multiple threads for use in conjunction with a hardware multi-threaded microprocessor, in accordance with the prior art, is indicated generally by the reference numeral 300. Thread number 310 and instruction-specified register number 320 are used to rename a specific register with register mapper 330 to any of a plurality of physical registers in a unified physical register file 340, wherein all threads can be allocated to a given register. A set of read ports 350 is used to select for each renamed register the specified physical register based on renaming by mapper 340 from any of the registers in the unified physical register file.
It will be understood that the phrase “unified physical register file” as used herein describes the unified architecture with respect to storing data from multiple threads, and not to ISA characteristics, such as whether a specific architecture supports separate register files for different data types.
One prior art approach involves a multi-thread memory for a microprocessor, wherein the memory uses a write-in interface and reading interface of multi-thread memory cell to select among contents of thread correspondence inside a register cell, based on thread that is identified to correspond at least partially.
Turning to FIG. 4, a two-threaded storage cell with a multi-threaded read port, in accordance with the prior art, is indicated generally by the reference numeral 400. In accordance with the design of FIG. 4, storage cells have independent write ports 402 and 406 for writing to two storage elements 404 and 408 that store respective data for a first and a second thread. Thread select logic 410 selects between storage elements corresponding to a first and second thread, and provides the data to a group of read ports 414.
Advantageously, this design reduces the number of routing resources required by requiring only one signal wire for two storage cells under the control of thread select logic 410. However, this design also disadvantageously limits the read ports such that all reads correspond to a single thread. While this has not been a limitation for all prior art microprocessors with hardware multi-threading, this type of design optimization has not been applicable to simultaneous multiprocessor systems, wherein a first instruction may read from a storage element 404 and a second instruction may require data read access to data in storage element 408. Disadvantageously, this design can also not be used in conjunction with traditional register renaming architectures, such as those in accordance with FIG. 3.
Turning to FIG. 5A, a method for generating register file addresses in a processor with hardware-multi-threading that does not implement register renaming and in conjunction with a conventional register file, in accordance with the prior art, is indicated generally by the reference numeral 500. The method starts with step 510.
In step 510, a thread identifier is used in conjunction with a per-thread register number to generate a processor-wide unique register number, and control is passed to step 520. In accordance with one implementation of this step, concatenation of the thread identifier and register specifiers is performed, as expressed by the following VHDL:                FRA_ADDR<=TID & FRA_FIELD;        FRB_ADDR<=TID & FRB_FIELD;        FRC_ADDR<=TID & FRC_FIELD;        FRT_ADDR<=TID & FRT_FIELD;        
In the example, and in accordance with an exemplary implementation of the Power Architecture, the FRA_FIELD, FRB_FIELD, FRC_FIELD, FRT_FIELD variables correspond to the per-thread fields extracted from exemplary 5 bit operand fields in the instruction word, or microcode ROM, or generated by instruction cracking, or otherwise obtained. The thread identifier (TID) variable furthermore corresponds to the currently active thread's thread ID, e.g., an exemplary 2 bit vector, and the FRA_ADDR, FRB_ADDR, FRC_ADDR, and FRT_ADDR vectors are 7 bit vectors uniquely specifying an entry in a 128 entry register file.
In step 520, the processor-wide register number is used to perform a read access from and/or a write access to a register file capable of storing registers for a plurality of threads, and the method is terminated.
Turning to FIG. 5B, a method for generating register file addresses in a processor with hardware-multi-threading that implements register renaming and in conjunction with a conventional register file, in accordance with the prior art, is indicated generally by the reference numeral 550. The method 550 starts with step 560.
In step 560, a thread identifier is used in conjunction with a per-thread register number to generate a processor-wide unique register number, and control is passed to step 570. In accordance with one implementation of this step, concatenation of the thread identifier and register specifiers is performed, as expressed by the following VHDL:                FRA_ADDR<=TID & FRA_FIELD;        FRB_ADDR<=TID & FRB_FIELD;        FRC_ADDR<=TID & FRC_FIELD;        FRT_ADDR<=TID & FRT_FIELD;        
In the example, and in accordance with an exemplary implementation of the Power Architecture, the FRA_FIELD, FRB_FIELD, FRC_FIELD, FRT_FIELD variables correspond to the per-thread fields extracted from exemplary 5 bit operand fields in the instruction word, or microcode ROM, or generated by instruction cracking, or otherwise obtained. The TID variable furthermore corresponds to the currently active thread's thread ID, e.g., an exemplary 2 bit vector, and the FRA_ADDR, FRB_ADDR, FRC_ADDR, and FRT_ADDR vectors are 7 bit vectors uniquely specifying one of 128 logical registers, corresponding to the architected state of 4 threads.
In step 570, exemplary 128 logical registers are renamed in accordance with a rename method, generating a unique physical register name in a register file having more than 128 entries, and control is passed to step 580. In accordance with this implementation, a register mapper is not cognizant of the threaded nature of the processor, and can dynamically assign any physical register to hold a logical register from any of the four threads.
In step 580, the processor-wide physical register number is used to perform at least one of a read and write access to a physical register file capable of storing registers for a plurality of threads, and the method is terminated.
Turning to FIG. 14A, a prior art instruction scheduling method commonly used in conjunction with the prior art register file of FIG. 4, is indicated generally by the reference numeral 1400. The method begins with test 1402.
In test 1402, the thread number tested. If the thread number corresponds to a first thread number, control passes to step 1404. Otherwise, control passes to step 1406.
In step 1404, a first instruction is issued corresponding to thread 0 to a first issue slot if a ready instruction is available for thread 0, and control is passed to step 1405.
In step 1405, a second instruction is issued corresponding to thread 0 to a second issue slot if a ready instruction is available for thread 1, and the method is terminated.
In step 1406, a first instruction is issued corresponding to thread 1 to a first issue slot if a ready instruction is available for thread 1, and control is passed to step 1407.
In step 1407, a second instruction is issued corresponding to thread 1 to a second issue slot if a ready instruction is available for thread 1, and the method is terminated.
Those skilled in the art will understand the limitations and disadvantages in requiring a first and second instruction to be from the same thread. Those skilled in the art will also understand the limitations of this approach due to a lack of register renaming capability.
Turning to FIG. 14B, a prior art instruction scheduling method commonly used for SMT processing in conjunction with the register file of FIG. 1A, and where instructions operands have been renamed in accordance with the method of FIG. 5B, is indicated generally by the reference numeral 1410.
The method starts with step 1412.
In step 1412, a first instruction is issued to a first issue slot if an instruction is ready for any thread, and control is passed to step 1413.
In step 1413 a second instruction is issued to a second issue slot if an instruction is ready for any thread, and the method is terminated.
While these methods allow the use of unmodified register file and register rename structures, they require support for arbitrary combinations of register file accesses from each port to any of the registers. While this affords flexibility and allows the use of thread-unaware register files and register mappers, it leads to wasteful designs with large area and delay.