Power/energy and environment problems are gaining more and more concern of people. In IT infrastructure, there are thousands of computing devices running in the world every day, such as mobile phones, desktops, servers etc. However, relevant statistics show that the resource utilization efficiency of these computing devices is quite low. For example, the utilization efficiency of x86 server represented by Intel stands at 10% to 15% only.
People have been fighting against power consumption issues from different levels, e.g., to design low power consumption chipsets with advanced technologies or to adjust the operating voltage and frequency according to workloads.
All modern processors, i.e. CPUs, are fabricated with CMOS technologies. CMOS power consumption is divided into dynamic power consumption and static power consumption. Static power consumption means standby power consumption even when devices do not implement any action, which is mainly caused by leakage currents of transistors. Dynamic power consumption is mainly the power consumption for turnover actions when transistors are operating. Dynamic power consumption is the main source of CPU power consumption, so the solution of dynamic power consumption in a CPU is the key to the reduction of the total CPU power consumption. As is well known in the art, dynamic power consumption is illustrated by formula (1):PDYNAMIC=CL·NSW·V2DD·f  (1)in which PDYNAMIC is dynamic power consumption, CL is the total equivalent capacitance, NSW is the ratio of switching action to system clock, VDD is the supply power, and f is operating frequency. As is clear from formula (1), dynamic power consumption is proportional to the total equivalent capacitance. If the total equivalent capacitance can be reduced, dynamic consumption can be reduced linearly, and the total CPU power consumption can be decreased accordingly. In chip design, the capacitance is equivalent to the transistors to carry out operations. Therefore, to reduce the transistors carrying out operations in chips is equal to reduce the equivalent capacitance, and thus power consumption can be decreased.
Modern integrated processors, such as microprocessors and digital signal processors, are commonly designed using a complex pipelined architecture. In a CPU, pipeline and data/instruction feeding logics are the most active components which occupy a fairly large part of the total CPU power consumption. According to statistics, pipeline occupies about ⅓ of the total power consumption. Therefore, to decrease the power consumption of pipeline can reduce the power consumption of processors effectively.
The concept of pipeline references assembly pipeline in industrial production. Specifically, in a CPU, several (e.g. six) circuit units with different functions form an instruction processing pipeline, and then an instruction is divided into six steps implemented by these circuit units. Through the instruction superimposing method, an instruction can be completed in one CPU clock cycle, so the CPU computing speed is enhanced. Typically, these circuit units with different functions on a pipeline are called processing stage (or called “stage” optionally). Each stage executes a specific function and transfers the processing result to the next stage.
In an early pipelined structure, there are only pipeline stages that fulfill basic functions. The most classical one is a five-stage pipelined structure whose detailed functions comprise: instruction fetch (IF), instruction decoding (ID), execution (EX), memory access (MEM) and write back (WB). Among them, instruction fetch IF is for fetching an instruction from PCs and random memories (RAM) and outputting the instruction to the next stage; instruction decoding ID is for decoding the fetched instruction; execution EX usually includes an arithmetic logic unit (ALU) for executing the decoded instruction; memory access MEM is for accessing a memory to acquire operation data; and write back WB is for writing the result of execution to a register or memory for later use. With these pipeline stages fulfilling basic functions, CPU can execute computing tasks. Let's define this design methodology of ARM-like CPU as “design for efficiency (DFE).”
With the development of the production process for integrated circuits and the increasing reduction of the dimension of transistors, more transistors can be integrated on a single chip, and the requirement on the computing performance imposed by people gets more and more stringent at the same time. Therefore, some functions for improving the computing performance have been added to CPU pipeline design, and the number of stages of a pipeline also increases gradually, such as 13 stages, 19 stages, or even 30-odd stages. The five-stage pipelined structure outlined above is elaborated, and the five basic functions are distributed over more stages, and the function fulfilled in each stage gets increasingly complicated. These additional functions for improving the computing performance, for example, may include: superscalar, hazards detection, branch prediction, register renaming, issue selection, reorder buffer, data forwarding, speculative execution, dynamic scheduling etc. Among them, superscalar simultaneously executes multiple processes by building in a plurality of pipelines. In other words, superscalar implements multiple tasks using more transistors, i.e. trades space for time. Hazards detection, also called out-of-order execution, means dispatching, not in an order specified by the program, a plurality of instructions to respective pipeline stages to be processed, but first making analysis and judgment according to the operating state of each pipeline stage and the fact whether an instruction to be processed can be executed in advance, and then according to the result of judgment, sending the instruction which can be executed in advance to corresponding pipeline stages for execution, i.e. making full use of idle pipeline stages to cause CPU internal circuits to operate at a full load and thereby improving the speed of CPU operating programs. Branch prediction involves CPU dynamic execution techniques, in which instructions are executed in an original order when there is no conditional branch in the instructions and when there is a conditional branch in instructions, decision is made, according to the result of processing the instructions in pipelines, as to whether the instructions are executed in an original order. Branch prediction means it is able to predict, before the result of the previous instruction is produced, whether a branch is transferred, so corresponding instructions can be executed in advance. As a result, pipelines are prevented from idle wait, and the CPU computing speed is enhanced. Other additional functions are added for the purpose of performance enhancement. Let's define this design methodology of Pentium-like CPU as “design for performance (DFP)”.
FIG. 1a and FIG. 1b illustrate a typical pipelined architecture having 19 stages in modern processors, in which, in addition to the basic functions requisite for carrying out tasks mentioned above, there are various complex functions used for performance enhancement and some pipeline stages include multiple modules.
As is clear from the architecture design of modern processors described above, the addition of a large number additional functions, although bringing about higher performance and faster speed for the computation of processors, is implemented by more transistors, thereby occupying a larger chip area and consuming more power. For example, the mainstream PC configuration in 1995 was 486d×2-66 CPU with the power consumption of 2.5 W; at present, Pentium 4 (Presocott) has a peak power of 255 W, an idle power of 120 W. This records an average increase of over 60 times.
Further, we have noted that there is a severe resource waste in existing computing devices: almost all CPUS of computing devices are under utilization or have an extremely low utilization efficiency. For example, assume a desktop CPU can operate at a high frequency of 3.2 GHz, whereas such a high computing power, i.e. its peak performance, is needed only during 5% of the time in a day, and a low-frequency CPU with simple pipelines can fulfill the computing task during 95% of the time. However, due to the complicated pipeline design used by the CPU, the aforesaid pipeline stages and/or modules for improving the computing performance are still invoked during 95% of the time even if there is no need for a quite high computing performance. Therefore, the use of these pipeline stags and/or modules occupies a large amount of energy without any meaningful output, which wastes power. Furthermore, with the advancement of computer technologies, users are more concerned with which manner is the most electricity-saving, i.e. has the lowest power consumption on the premise of meeting the performance requirements.
It can be seen from the analysis provided above that since these computing devices operate at a low load at most time, there is a large energy saving space in these computing devices. If the power consumption of computing devices can be reduced without impairing the performance or minimizing the performance loss, then energy and environment problems will be alleviated greatly.
In IEEE 2001, Proceeding of the 28th Annual International Symposium on Computer Architecture (ISCA'01), an essay entitled “Power and Energy Reduction Via Pipeline Balancing” (hereinafter referred to as reference document 1) proposes a pipeline balancing (PLB) technique. This technique adjusts the width of pipelines based on the amount of parallelism within a program and achieves the purpose of reducing the power consumption by adopting relatively wide pipelines when the amount of parallelism is relatively high and adopting relatively narrow pipelines when the amount of parallelism is relatively low. Its essence is superscalar technique. This technique reduces power consumption in a certain extent, whereas it has limitations: it decides the width of pipelines based on the amount of parallelism within a program only, regards a pipeline as a unit and thus is a relatively coarse power consumption management technique. Therefore, this technique is not applicable in at least two cases. One case is, for example, when the amount of parallelism within a program is very high but the requirement on performance is not very high. For example, during browsing a web page, a browser will naturally browse from the top down according to his reading habit. If the content of the lower portion is displayed slightly later than the content of the upper portion, it does not affect the user's browsing experience and can be accepted by the user. According to the PLB technique, however, the adoption of relatively wide pipelines will present all content on the web page to the user at the same time. Although the performance is very high, the user's experience makes no difference. Additionally, more power will be consumed. The other case is, for example, during processing programs that cannot be processed in parallel, such as audio play and the like, only relatively narrow pipelines can be adopted according to the PLB technique. Although power consumption can be saved, performance cannot be improved.
The prior art outlined above reduces CPU power consumption in a certain extent from an aspect, whereas its application is rather limited and it does not achieve the optimal balance between performance and power consumption. Therefore, there is a need for a mechanism which can effectively reduce power consumption and meet certain performance requirements and is applied to CPUs in various environments.