A memory subsystem has always been a major component in microprocessors. This contributes to both performance and energy consumption in microprocessors. Various memory structures in microprocessors, such as the register file and primary memory, while contributing positively towards bridging data latency and bandwidth gaps, consume a significant amount of energy. This energy consumption by memory structures is disproportionate, relative to other microprocessor components. In particular, the energy consumption in a memory subsystem due to instruction references is typically higher than those of data references. A typical dynamic instruction to data reference mix for the typical commercial reduced instruction set computing (RISC) industry standard architecture (ISA) is about three to one. Hence, reducing an instruction fetch energy can be critical to reducing the total energy consumed by the microprocessor.
Many emerging software applications, especially in a digital signal processing and embedded space, are loopy and/or loopy-nested in nature and are characterized by spending a large fraction of their execution times on small to medium sized program loops. Specifically, traditional embedded software applications, such as paging, fax, and gaming, as well as traditional floating-point applications, exhibit this behavior and the base of such software applications is growing.
The instruction fetch energy for these software applications can be reduced by using a small instruction buffer, often referred to as a loop buffer, to capture and manage repetitive loop streams. The loop buffer can be built out of fast, low power, loosely dense memory arrays. When instructions are captured and repetitively handled through such loop buffer, the larger primary instruction memory unit can be shutdown, leading to substantial savings in dynamic power dissipation.
The control and operation of such a loop buffer can be handled statically or dynamically through the aid of software or hardware techniques. Software approaches via the compiler, normally require profile-driven tasks to identify loop ranges of a program code and a timing of when to capture instructions into the loop buffer. Software directed approaches, and even hardware approaches, that rely on static decisions for a-priori code placement in the loop buffer before program execution, are limited in their applications and ability to reduce the fetch energy. Besides, there are issues having to do with legacy code applications that will negatively impact a new computer processor's ability to be effectively backwards compatible with a computer processor line.
The hardware technique that dynamically adjusts to program execution behavior for detecting and managing loop constructs is more portable and can see more appropriate applications. Such hardware technique requires a mechanism for dynamically identifying, capturing, and managing loops during the program execution.
Identifying loops during program execution can be accomplished in a variety of ways. For some software and hardware approaches special instructions identifying the start of a loop can be added to the ISA where detecting one of such instructions in a program will signify a start of the loop. For ISAs with no such special hint instructions, generally, a branch instruction with a backward displacement target may identify a loop where the displacement target address will be the start of the loop. One approach used to identify a loop is to monitor instructions that have been decoded for branch operation codes with negative displacement distance. Relying on this operation code monitoring approach is expensive since loop identifications can be delayed a couple of cycles because of the “wait for information” in the decode stage.
Understanding loops and loop formations is necessary for developing the necessary approach in identifying them. FIG. 1a illustrates a backwards branch instruction target used to identify a loop. For an ISA machine that executes two delayed slots following a branch, the “bct” loop body 20 in this example consists of 13 instructions I0 to I12, because the execution flow will result in instructions I11 and I12 as part of the loop. The instruction I0, labeled “Loop_Start” and the instruction I12, labeled “Loop_End”, can be taken to define the range parameters of the “bct” loop body. For the ISA machine which does not implement delayed slots, the “bct” loop body consists of only 11 instructions, with Loop_Start being instruction I0 and Loop_End being instruction I10, which is the “bct” instruction itself.
There have been numerous attempts at reducing energy consumption in cache memory systems through the use of small instruction buffers placed between an execution pipe and a larger primary cache memory. Such attempts were described by N. Bellas, I. Hajj and C. Polychronopoulos, in “A New Scheme for I-Cache Energy Reduction In High-Performance Processors,” Power Driven Microarchitecture Workshop, held in conjunction with ISCA 98, Barcelona, Spain, Jun. 28, 1998; K. Ghose and M. B. Kamble, in “Energy Efficient Cache Organizations for Superscalar Processors,” Power Driven Microarchitecture Workshop, held in conjunction with ISCA 98, Barcelona, Spain, Jun. 28, 1998; J. Kin, M. Gupta and W. Mangione-Smith, “The Filter Cache: An Energy Efficient Memory Structure,” Proc. Int'l Symposium on Microarchitecture, pp. 184–193, Dec. 1997; and C. Su, A. M. Despain, in “Cache Design Tradeoffs for Power and Performance Optimization: A Case Study,” Proc. Int'l Symposium on Low Power Design, pp. 63–68, 1995.
Energy consumption can be minimized by not accessing the primary cache memory if the requested instructions are found in the instruction buffer. Using this general approach, however, may incur processor cycle penalties due to the instruction buffer misses. Even in situations where the primary memory is operated in the same cycle, only if there is a buffer miss, the approach stands to work at the expense of longer cycle times. These approaches fail to explicitly capture frequently and repetitively executed loops.
Lee et al. identified this problem and proposed a dynamically controlled hardware loop cache solution. See L. H. Lee, W. Moyer and J. Arends, “Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops,” Proc. Int'l Symp on Low Power Design, 1999; and L. H. Lee, J. Scott, W. Moyer and J. Arends, US patent Pending, “Data Processor System”. In that solution, a loop cache controller fills the cache after detecting a simple loop defined as any short backwards branch instruction. The approach monitors a decoded instruction stream for branch instructions that have short target displacements. The fill occurs by copying the dynamic instruction stream into the loop cache. Once the fill is complete and the short backwards branch is taken again, the loop cache controller seamlessly switches instruction fetching to the loop cache, thereby logically shutting down the primary cache memory unit. The controller conservatively aborts the fill or exits the loop cache fetching if a jump is taken within the loop, since a taken jump means the entire loop may not be filled or the system is actually leaving the loop. The loop cache controller uses a simple wraparound counter for indexing into the cache and requires no tag.
While the loop cache solution reduces the average instruction fetch power, it suffers from some serious limitations. Some of the issues causing these limitations include the following:                Some loops are not formed by a single backwards branch, in fact, loops may consist of several backwards branches, some pointing to the loop start.        Many loops contain internal branches like the “If-Then-Else” construct.        The loop cache controller is not capable of capturing and handling multiple nested loops of any order.        The loop cache approach is only able to capture loops that fit fully in the loop cache.        Any loop that is structurally longer than the size of the loop cache cannot be captured.        