1. Field of the Invention
The present invention generally relates to the multi-threaded processors and more particularly to reducing power consumption in a Simultaneous MultiThreaded (SMT) processor or microprocessor.
2. Background Description
Semiconductor technology and chip manufacturing advances have resulted in a steady increase of on-chip clock frequencies, the number of transistors on a single chip and the die size itself. Thus, not withstanding the decrease of chip supply voltage, chip power consumption has increased as well. Both at the chip and system levels cooling and packaging costs have escalated as a natural result of this increase in chip power. At the low end for small systems (e.g., handhelds, portable and mobile systems), where battery life is crucial, it is important to reduce net power consumption, without having performance degrade to unacceptable levels. Thus, the increase in microprocessor power consumption has become a major stumbling block for future performance gains. Pipelining is one approach to maximizing processor performance.
A scalar processor fetches and issues/executes one instruction at a time. Each such instruction operates on scalar data operands. Each such operand is a single or atomic data value or number. Pipelining within a scalar processor introduces what is known as concurrency, i.e., processing multiple instructions at difference pipeline stages in a given clock cycle, while preserving the single-issue paradigm.
A superscalar processor can fetch, issue and execute multiple instructions in a given machine cycle, each in a different execution path or thread. Each instruction fetch, issue and execute path is usually pipelined for further, parallel concurrency. Examples of superscalar processors include the Power/PowerPC processors from IBM Corporation, the Pentium processor family from Intel Corporation, the Ultrasparc processors from Sun Microsystems and the Alpha processor and PA-RISC processors from Hewlett Packard Company (HP). Front-end instruction delivery (fetch and dispatch/issue) accounts for a significant fraction of the energy consumed in a typical state of the art dynamic superscalar processor. For high-performance processors, such as IBM's POWER4™, the processor consumes a significant portion of chip power in the instruction cache (ICACHE) during normal access and fetch processes. Of course, when the fetch process stalls, temporarily (e.g., due to instruction buffer fill-up, or cache misses), that portion of chip power falls off dramatically, provided the fetch process is stalled also.
Unfortunately, other factors (e.g., chip testability, real estate, yield) tend to force a trade of power for control simplification. So, in prior generation power-unaware designs, one may commonly find processors architected to routinely access the ICACHE on each cycle, even when the fetched results may be discarded, e.g., due to stall conditions. Buffers and queues in such processor designs have fixed sizes, and depending on the implementation, consume power at a fixed rate, irrespective of actual cache utilization or workload demand. For example, for a typical state of the art instruction fetch unit (IFU) in a typical state of the art eight-issue superscalar processor, executing a class of commercial benchmark applications, only about 27% of the cycles result in useful fetch activity. Similarly, idle and stalled resources of a front-end instruction decode unit (IDU) pipe wastes significant power. Further, this front-end starvation keeps back-end execute pipes even more underutilized, which impacts processor throughput.
By contrast, in what is known as an energy-aware design, the fetch and/or issue stages are architected to be adaptive, to accommodate workload demand variations. These energy-aware designs adjusts the fetch and/or issue resources to save power without appreciable performance loss. For example, Buyuktosunoglu et al. (Buyuktosunoglu I), “Energy efficient co-adaptive instruction fetch and issue,” Proc. Int'l.Symp. on Computer Architecture (ISCA), June 2003 and Buyuktosunoglu et al. (Buyuktosunoglu II), “Tradeoffs in power-efficient issue queue design,” Proc. ISLPED, August 2002, both discuss such energy aware designs. In particular, Buyuktosunoglu I and II focus on reconfiguring the size of issue queues, in conjunction (optionally) with an adjustable instruction fetch rate. In another example, Manne et al., “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25th Int'l. Symp. on Computer Architecture (ISCA), 1998, teaches using the processor branch mis-prediction rate in the instruction fetch to effectively control the fetch rate for power and efficiency. Unfortunately, monitoring the branch prediction accuracy requires additional, significant and complex on-chip hardware that consumes both valuable chip area and power.
This problem is exacerbated in multithreaded machines, where multiple instruction threads may, or may not be in the pipeline at any one time. For example, Karkhanis et. al, “Saving energy with just-in-time instruction delivery,” Proc. Int'l. Symp. on Low Power Electronics and Design (ISLPED), August 2002, teach controlling instruction fetch rate by keeping a count of valid, downstream instructions. Both U.S. Pat. No. 6,212,544 to Borkenhagen et al. (Borkenhagen I), entitled “Altering thread priorities in a multithreaded processors,” and U.S. Pat. No. 6,567,839 to Borkenhagen et al. (Borkenhagen II), “Thread switch control in a multithreaded processor system,” both assigned to the assignee of the present invention and incorporated herein by reference, teach designing efficient thread scheduling control for boosting performance and/or reducing power in multithreaded processors. In yet another example, Seng et al. “Power-Sensitive Multithreaded Architecture,” Proc. Int'l. Conf. on Computer Design (ICCD) 2000, teaches an energy-aware multithreading design.
State of the art commercial microprocessors (e.g. Intel's Netburs™ Pentiu™ IV or IBM's POWER5™) use a mode of multithreading that is commonly referred to as Simultaneous MultiThreading (SMT). In each processor cycle, a SMT processors simultaneously fetches instructions and/or dispatches for different threads that populate the back-end execution resources. Fetch gating in an SMT processor refers to conditionally blocking the instruction fetch process. Thread prioritization involves assigning priorities in the order of fetching instructions from a mix of different workloads in a multi-threaded processor. Some of the above energy-aware design approaches have been applied to SMT. For example, Luo et al. “Boosting SMT Performance by Speculation Control,” Proc. Int'l. Parallel and Distributed Processing Simulation, (IPDPS), 2001, teaches improving performance in energy-aware SMT processor design. Moursy et al. “Front-End Policies for Improved Issue Efficiency in SMT Processors,” Proc. HPCA 2003, focuses on reducing the average power consumption in SMT processors by sacrificing some performance. By contrast, Knijnenburg et al. “Branch Classification for SMT Fetch Gating,” Proc. MTEAC 2002 focuses on increasing performance without regard to complexity. These energy aware approaches require complex variable instruction fetch rate mechanisms and control signals necessitating significant additional logic hardware. The additional logic hardware dynamically calculates complex utilization, prediction rates and/or flow rate metrics within the processor or system. However, the verification logic of such control algorithms adds overhead in complexity, area and power, that is not amenable to a low cost, easy implementation for high performance chip designs. This overhead just adds to both escalating development costs and spiraling power dissipation costs.
Unfortunately, many of these approaches have achieved improved performance only at the cost of increased processor power consumption. Others have reduced power consumption (or at least net energy usage) by accepting significantly degraded performance. Still others have accepted complex variable instruction fetch rate mechanisms that necessitate significant additional logic hardware.
Thus, there is a need for a processor architecture that minimizes power consumption without impairing processor performance and without requiring significant control logic overhead or power.