1. Field of the Invention
The present invention generally relates to reducing power consumption in microprocessor functional modules inherently requiring dedicated memory, buffer or queue resources. Such power reduction is achieved without appreciable performance loss, as measured using instructions per cycle (IPC), or significant additional hardware.
Specifically, dynamic statistics are gathered for the functional module activity and, depending on need or activity, the module is adaptively sized down or up in units of a predetermined block or chunk size. An exemplary embodiment addresses an out-of-order issue queue in high-end super scalar processors.
2. Description of the Related Art
Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. This invention is a solution to the problem of reducing power consumption, without appreciable IPC-centric performance loss, while concurrently even providing the opportunity of a module-specific clock speed increase, or alternatively a module-specific power reduction via voltage scaling. The specific embodiment herein described addresses an out-of-order issue queue of a superscalar processor, but the approach lends itself to other specific functional modules.
A non-pipelined scalar processor is one that issues and processes one instruction at a time. Each such instruction operates on scalar data operands, and each such operand is a single or atomic data value or number. Pipelining within a scalar processor introduces concurrency in processing multiple instructions in a given clock cycle, while preserving the single-issue paradigm. In contrast, a vector processor, also known as an array or SIMD (single instruction, multiple data) processor, is one that can perform an operation on an entire array of numbers in a single architectural step or instruction. Cray supercomputers are vector processors. They have machine instructions that can, for example, add each entry of array A to the corresponding entry of array B and store the result in the corresponding entry of array C.
A superscalar processor is in between the scalar processor and vector processor architectures. It accepts instructions like those of a scalar processor but has the ability to fetch, issue and execute multiple instructions in a given machine cycle. In addition, each instruction execution path is usually pipelined to enable further concurrency. IBM's Power/PowerPC™ processors, Intel's PentiumPro (P6)™ processor family, Sun's Ultrasparc™ processors, Compaq's Alpha™ family and HP's PA-RISC™ processors are all examples of superscalar processors.
In the superscalar processor architecture, the purpose of the issue queue is to receive instructions from the dispatch stage and forward “ready instructions” to the execution units. Such an issue queue in high-end super scalar processors, such as the current IBM Power4™ machine, typically burns a lot of power, because: (a) the queue is implemented using a continuously clocked array of flip-flops and latches (b) instructions are issued (out of order with respect to the original sequential program semantics) in a given cycle. So the required issue logic is quite complex, requiring run-time dependence checking logic. Also, as discussed in greater detail below, “holes” in the queue left by instructions that have been issued are filled up in the subsequent cycle, by a “shift and compact” strategy, which is quite an energy-consuming task. In addition, since the queue size is fixed and the latches are continuously clocked, such an implementation inherently wastes a lot of energy by not exploiting the dynamically changing requirements of the executing workload.
Two schemes by Gonzalez et al. at UPC, Barcelona, Spain, addressed the problem of power and/or complexity reduction in issue queues without significantly impacting the IPC performance (see R. Canal and A. Gonzalez, “A low-complexity issue logic”, Proc. ACM Int'l. Conference on Supercomputing (ICS), pp. 327-335, Santa Fe, N.M., June, 2000; and D. Folegnani and A. Gonzalez, “Reducing the power consumption of the issue logic”, Proc. ISCA Workshop on Complexity-Effective Design, June, 2000). The first scheme reduces the complexity of the issue logic by employing an additional separate “ready queue” which holds only instructions with operands that are determined to be fully available at decode time. Thus, instructions can be issued “in-order” from this “ready queue” at reduced complexity without associative lookup. A separate “first-use” table is used to hold instructions, indexed by unavailable operand register specifiers.
Only those instructions that are first-time consumers of these pending operands are stored in this table. Instructions that are deeper in the dependence chain simply stall or are handled separately through a separate issue queue. The dependence link information connecting multiple instances of the same instruction in the “first-use” table is updated after each instruction execution is completed. At the same time, if a given instruction is deemed to be “ready” it is moved to the in-order ready queue. Since none of the new structures require associative lookups or run-time dependence analysis and yet instructions are able to migrate to the ready queue as soon as the operands become available, this scheme significantly reduces the complexity of the issue logic. However, it is not clear whether the net energy consumed is reduced by using this method. In fact, there could be an overall increase in power consumption due to the additional queues.
The second approach relies on static scheduling. Here the main issue queue only holds instructions with pre-determined availability times of their source operands. Since the queue entries are time-ordered due to known availabilities, the issue logic can use simple, in-order semantics. Instructions with operands which have unknown availability times are held in a separate “wait queue” and get moved to the main issue queue only when those times become definite.
In both approaches, the emphasis is on reduction of the complexity of the issue control logic. The added, or augmented, support structures in these schemes may actually cause an increase of power, in spite of the simplicity and elegance of the control logic. In the second scheme, a major focus is purportedly on power reduction. The issue queue is designed to be a circular queue structure with head and tail pointers, and the effective size is dynamically adapted to fit the ILP content of the workload during different periods of execution.
In both schemes, Gonzalez et al show that the IPC loss is very small with the suggested modifications to the issue queue structure and logic. Also, in the second scheme, a trace-driven power-performance simulator, based on the model by Cai (G. Cai, “Architectural level power/performance optimization and dynamic power estimation”, in Proceedings of the CoolChips Tutorial, in conjunction with Micro-32, 1999), is used to report substantial power savings on dynamic queue sizing. However, a detailed circuit-level design and simulation of the proposed implementations are not reported in either approach. Without such analysis, it is difficult to gauge the cycle-time, i.e., clock frequency, impact or the extra power/complexity of the augmented design.
In the second scheme, the main focus is indeed power reduction and the approach employed is similar in spirit to our invention, in that the queue size is dynamically altered. However, the method used is completely different from the approach in this invention nor do the developers describe any circuit-level implementation. It is more of a concept paper, limited to micro architectural design concepts. Also, in this design a circular queue structure, based on flip-flops is implied, whereas the current invention uses a CAM/RAM based design as the building block.
The prior work by Albonesi et al. (“Dynamic IPC/Clock Rate Optimization”, D. H. Albonesi, Proc. ISCA-25, pp. 282-292, June/July, 1998; “The Inherent Energy Efficiency of Complexity-Adaptive Processor”, D. H. Albonesi, Proc. ISCA Workshop on Power-Driven Microarchitecture”, June, 1998) is based on the principle of dynamic adaptation, but applied specifically to cache and memory hierarchy design. In that work, there is also reference to possibly applying adaptation techniques to other structures, like instruction queues, but without describing any implementation details. In the paper titled: “Dynamic IPC/Clock Rate Optimization,” there is no description of the hardware control mechanism used to make reconfiguration decisions or actual size and geometry changes. In terms of the suggested implementation, the work reported is based upon the use of repeater circuits, that are used to dynamically resize the cache array. This is quite different in content and context, when compared to the out-of-order issue queue design to be shortly discussed as used in the present invention. In concept, the reconfiguration decision in this paper by Albonesi is based on the metric TPI (time per instruction), computed as the cycle time divided by IPC. In particular, for the reconfigurable cache design, the metric monitored during adaptation was the average TPI due to cache misses. U.S. Pat. No. 6,205,537 to Albonesi further discusses the content of that paper.
By contrast, in the present invention, as will be explained shortly, the primary decision logic used in adapting the issue queue size is based on monitoring the activity of the issue queue during a cycle window. The IPC (instructions per cycle) is monitored as a guard mechanism to override the primary reconfiguration decision to avoid large performance shortfalls. Thus, in our invention, the reconfiguration decision is based on a combination of activity measurement and IPC history.
The paper by Albonesi (“Selective Cache Ways: On Demand Cache Resource Allocation”, D. Albonesi, 32nd International Symposium on Microarchitecture, November, 1999) is again focused on cache design; but this paper proposes the disabling of inactive cache ways to save power in a set-associative design. The partitioning of the cache array and directory into “ways” is a natural part of set-associative cache design. This is quite unlike an issue queue design, which is normally not partitioned into substructures. Albonesi's power-efficient design proposal, in this case, leverages an existing partitioned array organization and adds the necessary logic needed to selectively disable sub-arrays, when the resultant performance loss is deemed to be within a predefined acceptance level. More explicitly, the disable/enable decisions in this design are based on a metric termed PDT or performance degradation threshold. The PDT signifies the allowable margin of performance loss in a disablement decision. Exact logic and circuit implementations to use PDT in a reconfiguration decision are not discussed in this paper.
In the paper “Memory Hierarchy Reconfiguration . . . ” (R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, “Memory Hierarchy Reconfiguration for Energy and Performance in General-Purpose Processor Architectures”, 33rd International Symposium on Microarchitecture, December, 2000), these authors again propose the use of repeater insertion technology to adapt cache and TLB (translation look aside buffer) structures within a processor. Reconfiguration decision is based on phase change detection in an application along with hit and miss tolerance metrics. Like the original Albonesi paper on “Dynamic IPC/Clock Rate Optimization”, therefore, this paper is quite different in scope and concept from the present invention.