1. Field of the Invention
This invention relates to computing systems, and more particularly, to increasing processor throughput by decreasing a loop critical path.
2. Description of the Relevant Art
The demand for ever-increasing throughput of processors, or the number of instructions retired per clock cycle (IPC), has followed different techniques. Maintaining a particular clock frequency, one approach to increase processor throughput is superscalar processing, which allows multiple instructions to be processed in a same pipeline stage per clock cycle. Generally speaking, assuming instructions do not experience data hazards or other pipeline stalls, a particular processor architecture that is able to dispatch, decode, issue, execute, and retire 3 instructions per clock cycle triples the throughput of a processor that doesn't implement superscalar processing. In actual operation, instructions do experience pipeline stalls. Therefore, the actual throughput will vary depending on the microarchitecture of the processor and the software application(s) being executed.
In addition to out-of-order issue of instructions to execution units within a superscalar microarchitecture, register renaming is another method that increases processor throughput. Register renaming dynamically renames register destination and source operands via the hardware. Register renaming reduces name dependences and allows a higher level of parallelization in code execution.
Further, increasing the rate of the clock, or the clock frequency, that synchronizes sequential elements on the processor die increases processor throughput. As the clock frequency increases, the processor's power consumption and temperature also increase. Therefore, design techniques such as clock gating may be utilized on the die. Still, outside of power consumption concerns, the processor clock frequency may not increase beyond a physical threshold rate at which signals traverse the processor between sequential elements and through combinatorial logic. Such a signal path that limits the clock cycle of a processor is referred to as a critical path. Typically, critical paths are determined during pre-silicon timing analysis when setup time violations are noted.
Each generation of a superscalar processor design may increase the instruction issue width, such as being capable of issuing 4 instructions out-of-order to execution units in a single clock cycle, rather than 3 instructions. Also, the clock period may be reduced. Among noise, area, power, hold time, and other design criteria, critical paths need to be resolved in order to satisfy these design requirements. One solution includes moving segments of combinatorial logic of a critical path to a previous or subsequent clock cycle corresponding to a pipeline stage that has more allowable computation time. However, more sequential elements may be required to save a new intermediate state, which increases clock power and reduces available real estate on the die.
Even if such a solution described above is viable, it won't resolve a loop critical path. A loop critical path begins with a particular sequential element, such as a flip-flop, traverses through wire routes and combinatorial logic of the path, and terminates at the same particular sequential element. Splitting this path with a second sequential element involves adding a costly pipeline stage to the design. In addition, a loop critical path may experience incorrect operation due to the second sequential element. The first half of a split path may not receive the correct output signals from corresponding flops, which are now receiving a cycle delayed output from the second half of the split path. In order to avoid incorrect operation, a stall may need to be inserted in the pipeline and the loop delay has grown to two costly clock cycles. For processor performance, it may be desirable to maintain this loop delay within one originally predetermined clock cycle.
An example of a loop critical path is the translation of stack-relative legacy x87 register specifier values. In a microarchitecture supporting execution of an x86 instruction set architecture (ISA), prior to a pipeline stage that performs superscalar register renaming of floating-point operands, translation of stack-relative x87 register specifiers is performed for floating-point instructions. Briefly, the x87 floating-point unit (FPU) uses an 8-entry table, which holds relative offsets with reference to a top-of-stack (TOS) value. The changes, or effects, a particular instruction has on the translate-table is dependent both on the operation of the particular instruction and on the effects of a prior instruction. Therefore, translation may become a serial process.
The logic for this process may consist of N identical cascaded copies of logic, where N is the number of instructions to be translated and whose operations affect the placement order of the contents within the translate-table. Each copy of translate logic performs the translation for one instruction based on both incoming current translate-table values and the particular operation of the instruction. Each copy of logic creates new translate table values at its output, which is then used as input values to a subsequent copy of logic. The critical path through the entire cascaded translate logic is simply proportional to N times the delay through one copy of logic.
The total amount of delay described above may not fit within a desired processor clock cycle when a design increases the width of the x87 floating-point translation logic from N to N+1 or the design decreases its clock cycle duration. Dividing the total path by placement of sequential elements within the path adds an undesirable and costly pipeline stage. However, not increasing the width from N to N+1 limits the throughput of subsequent rename, issue, and retire pipeline stages.
In view of the above, efficient methods and mechanisms for increasing the throughput of processors by decreasing loop critical path delay are desire.