Field of the Invention
The invention is generally related to systems and methods for reducing processor power consumption during the instruction fetch stage of processing. More specifically the disclosure is related to predicting the location of instructions within the instruction cache.
Related Art
Processor designers are continually attempting to improve the performance of processors. Performance can be measured in many different ways. For example, processor designers strive to reduce the power used by the processor. This can be done by turning off parts of the processor that are not currently handling tasks. For example, processor designers have broken up the area of memory that stores instructions, known as the instruction cache, into multiple pieces or ways, such as a 2 ways, 4 ways, etc. This allows the processor designer to power down ways that do not contain instructions that are currently being fetched.
In another example, processor designers may increase the speed of the processors, either in terms of the clock speed at which the processor runs at or the number of instructions the processor can complete in a given time period, e.g., in one second. In order to increase the speed that processors can execute instructions, processor designers have broken down the steps of executing instructions into discrete stages called pipelines. This allows designers to design processors that can execute multiple instructions in parallel, one for each stage. For example, the processor can be broken down into 5 pipeline stages, Instruction Fetch, instruction Decode, Execution, Memory Access, and Register Write Back. During execution, this allows a processor to fetch an instruction, while the previous instruction is being decoded, the instruction before that is being executed, the instruction before that is accessing memory, and the instruction before that is writing its results back into the register file. FIG. 1 illustrates this performance increase. FIG. 1A illustrates how a processor without any pipelined execution would process three instructions. In FIG. 1A, each of the instructions takes 5 time units to execute. Since there is no pipelining, each instruction cannot begin until the previous instruction finishes. Therefore, it takes 15 time units to complete these three instructions.
In FIG. 1B, the same three instruction are illustrated being executed on a pipelined processor. Because each instruction can begin a given stage as soon as the previous instruction is finished with that stage, the processor is able to process these three instructions in seven time units.
In more complex processor architectures these stages can further be broken down. For example the Instruction Fetch stage can be broken down into Instruction Prefetch (IP), Instruction Tag Read (IT), Instruction Predict (ID), and Instruction Select (IS). During the IP stage, the processor can determine the address of the next instruction to be fetched. During the IT stage, the processor can read the tags and data from the cache at the address. During the ID stage, the processor can figure out the way where the next instruction will be located, for example by reading the tags.
As the clock cycle decreases, the number of pipeline stages must increase to accomplish the required task. For example, if it took 40 ns to complete the instruction fetch stage, but the clock cycle needs to be decreased to 10 ns, the instruction fetch stage may be able to be broken down as described above. Unfortunately, this has the consequence of requiring sacrifices in other areas of performance. For example, there is not enough time to predict in which way the next instruction will be located before it is needed. Therefore, power savings that were introduced by partitioning the instruction cache into 2, 4, or more ways are now lost.