Microprocessors (referred to herein simply as “processors”) consume energy/power during their operation. It is advantageous to reduce the amount of energy consumed, particularly in the case of devices that run off limited power supplies.
Various factors affect the amount of energy that a processor consumes. For example, the frequency at which the processor operates, the voltage level that powers the processor, as well as the load capacitances affect processor energy consumption. Reducing the frequency of the processor or the voltage supply may decrease processor energy consumption, however, doing so may also adversely affect the performance of the processor.
Other techniques to reduce energy consumption by, for example, reducing load capacitances, may include changes to processor architectures and processor circuits. Some other techniques rely on modifying the application itself, or any other system layer, to improve energy efficiency.
An executable is a version of a software application that has been compiled from a programming language into a processor instruction set.
A source-level compiler transforms source codes into a sequence of instructions based on a processor instruction set.
Incorporating energy awareness into a source-level compiler is a very complex process; it could also negatively affect performance or some other design objective. This is due to the interactions between optimizations that target different objectives.
Additionally, not all source codes might be available for source-level compilation of an application, and therefore not all codes could be optimized in order to improve energy efficiency or reduce power consumption.
Moreover, the modifications would need to be incorporated in all compilers that aim to optimize energy consumption.
For a given processor there are typically many compilers available provided by many different vendors. These compilers have their own advantages and disadvantages.
This makes incorporating energy optimizations to source-level compilers a challenging task.
Accordingly, if energy-awareness is introduced at the executable-level instead, by transforming the executable itself, significant practical advantages could be achieved.
The goal would be to optimize executables that may have been fully optimized previously with a source-level compiler targeting a design aspect such as performance.
In general, such an executable-level re-compiling based approach to energy optimizations, could enable keeping optimizations performed during source-level compilation largely intact, could provide access to and optimize all program codes including static and dynamic libraries, and could be used on existing executable codes that have been generated with a variety of different compilers and potentially from different vendors.
The executable file itself provides a convenient interface between, for example, performance-oriented optimizations and energy-oriented optimizations. Moreover, one energy optimization tool or layer could be used with many different source-level compilers; this does away the need to retrofit all source compilers to optimize energy consumption.
Another aspect of this invention relates to scalability. Reductions in energy consumption should also be scalable, meaning that they are implemented such that processors having different architectures and instruction sets can easily be targeted. An executable-level re-compiling approach could provide such scalability.
This aspect may include an energy-aware program representation that encapsulates information reconstructed from executables in an abstract and retargetable manner to achieve scalability.
Another aspect of this invention relates to how application parallelism can be exploited in processors without significantly increasing load capacitances. If parallelism is achieved but with an increase in load capacitances, due to hardware complexities, the advantage of improved performance is offset by the resultant much higher power consumption. Current state-of-the-art solutions to expose parallelism are not energy efficient. As such, many of today's low-power processors are single issue.
Incorporating compiler information to enable issuing multiple instructions in parallel with a Very Large Instruction Word (VLIW) format has been shown to be detrimental to energy consumption. The term VLIW refers to the size of each instruction that is executed by a processor. This instruction is very long in comparison to the instruction word size utilized by most current processors.
Energy inefficiency in a VLIW processor is often attributed to the fixed wide-issue instruction set format; in many applications or program sequences there is not enough parallelism to fill all the instruction slots available in a VLIW instruction.
In fact, on average, there is typically very little instruction-level parallelism (ILP) available. As noted in the literature, typical applications have an average ILP level of less than two; thus, a 4-way VLIW would have on average two of its instruction slots unutilized. The unused slots would contribute to unnecessary instruction fetches and instruction-memory energy consumption. Higher ILP levels are fundamentally limited by true data dependencies. While speculation-based techniques can improve ILP levels, runtime speculation has an energy cost that typically offsets the benefits of the higher ILP.
As noted in the literature, energy consumed by the instruction memory as well as fetch energy are a significant fraction of a processor energy consumption. For example, a state-of-the-art ARM10 processor has been reported to consume 27% of its total energy in the instruction memory system.
Other systems such as superscalar processors attempt to discover parallelism at runtime with significant hardware support. This support reduces the energy benefits obtained with parallel execution. Simply, the performance benefits are more than offset by the increase in load capacitances that increase power consumption.
In one aspect, the present invention solves the problem of exposing parallelism without requiring significant hardware support, such as is required in superscalar designs, and without having a fixed wide-issue instruction format, such as in VLIW designs. The solution is adaptive and compiler driven.
It works by incorporating control bits into the binary to issue instructions in parallel on only selected sequences of instructions, on compiler demand. The approach could be limited to sequences where there is enough parallelism and when is considered or estimated to be good for energy efficiency. Thus, parallelism encoding can be limited to critical program paths, typically a relatively small fraction of the instructions in a binary, to improve energy efficiency.
In addition, if other compiler-managed optimizations are incorporated, such as for energy reduction purposes in the memory system, the added instruction bits for the various optimizations could be encapsulated into one or more new control instructions or control data. If the control is implemented with instructions, both regular instructions that are extensions to the regular ISA or co-processor instructions can be used.
In one embodiment, a solution to incorporate control information is to use the co-processor interface, that is, without requiring changes to the processor's regular instruction set. The inserted instructions may be folded, i.e., extracted early, in the prefetch queue before entering the processor pipeline, in a somewhat similar manner to zero-cycle branches in some architectures, e.g., ARM10. Such a solution removes pipeline bubbles that might otherwise be caused by the control instructions. An advantage, therefore, of using co-processor instructions is that one could easily add such control to existing processor cores. In one embodiment that is implemented within an ARM 10 design, each such control instruction would enable the encoding of 21 bits worth of control information.
Control information may be added per a sequence of instructions, such that the code dilution overhead of static control could be amortized across several different optimizations and for several instructions in the sequence. The sequence where optimization is applied can be determined with static analysis or other means such as profiling. A control instruction is typically inserted before the controlled instruction sequence.
Energy increase due to the extra fetches can be minimized with compiler-driven instruction memory optimizations, for example, by almost always fetching control bits from more energy efficient smaller cache partitions, driven by compiler decisions. One aspect of this invention demonstrates such capability.
Due to the compiler-driven nature of the solution, the impact of control overhead can be kept very small. In our experience, in one embodiment, such control energy overhead could be kept below 1%-2% if instruction memory energy optimizations are included, while providing energy optimizations in the range of 30%-68% if several techniques in different processor domains/components are included.