This invention relates generally to reducing power consumption and improving performance by a microprocessor, and particularly to a method and apparatus for dynamic optimization of arithmetic operations using operand-value-based detection to initiate clock-gating of execution units and the combination of multiple narrow-width operations for parallel execution.
Power consumption and performance of a central processing unit (CPU) is dependent upon several factors including the number of bits processed and the number and type of operations performed. Software applications are another significant driver of power consumption caused by increased addressing needs that drives processor designs to 64-bit words or larger. Typically, microprocessor designs only allow for full 64-bit addressing and equivalent operations, however, not all available 64-bits are necessary at the time of execution. In fact, the inventors of the present invention have discovered that over half of the integer operations performed by a 64-bit processor require only 16-bits of processing or less, leaving at least 48-bits unnecessary but still consuming power. The inventors have identified that every execution containing such unnecessary bits represents an opportunity to save power by disabling the unnecessary bits or improve microprocessor performance by using the unnecessary bits to exploit the full capability or bitwidth of the processor at execution time.
In the past, unsuccessful attempts to reduce power consumption include pre-execution compile time operational code clock gating for only a limited number of operations. In these systems a clock is used to gate latches preceding the execution unit in an attempt to reduce the number of bits at execution time for certain operations. An instructional opcode such as an xe2x80x9cadd_bytexe2x80x9d instruction is an example, because only the lower portion of an adder unit is required during execution.
One limitation in past systems is that opcode based optimization is not available in many of these systems due to a lack of restricted precision opcodes such as an xe2x80x9cadd_bytexe2x80x9d. Even if these precision opcodes are available, they can only be used at compile time. As a result, opcode based clock gating can only deliver marginal reductions in power consumption and no improvement in performance.
Another limitation in past systems is the use of narrow-width operands at compile time and the need to generate additional code to initiate subword parallelism, which is the use of multiple 8-bit or 16-bit operations by a 64-bit functional unit. The very use of programmers to identify instructions with narrow-width operands at compile time severely limits the opportunities to improve microprocessor performance by means of unnecessary bit utilization. Further, compilers cannot automatically generate the instructions needed by the processor at compile time.
Accordingly, the present invention addresses and solves the long felt need for significant reductions in microprocessor power consumption and increased performance by efficiently detecting unnecessary bits and disabling or utilizing these bits during execution. In stark contrast to these prior systems, the present invention uses operand-value-based techniques to significantly reduce power consumption or increase performance. Unlike systems in the past, operand-value-based techniques according to the present invention exploit every opportunity at execution time to disable unnecessary bits or to pack the unnecessary bit space, thereby reducing power consumption by as much as 60% in the execution unit, or increasing processor performance up to 10%.
The present invention is directed to a method and apparatus to reduce power consumption and increase performance of a microprocessor by optimizing the processing of narrow-width data when the higher order or uppermost bits of an operand are not necessary for execution.
In one embodiment of the present invention, run-time circuitry is provided for the detection of unnecessary higher order bits of an operand so that they may be disabled by clock gating prior to every execution in the processor. In contrast to prior opcode based systems, the present invention is the first to identify and use operand-value-based detection at run-time during every execution. Further, the unnecessary bits of an operand can also be effectively exploited by subword parallelism (operation packing) without programmer intervention or compiler support.
In another embodiment of the present invention, clock gating is used when a bit detect unit detects the condition of a pre-determined number of unused bits in an operand. Upon detection, a condition detect signal is generated and received by gating logic which initiates a gated clock signal. Latching circuitry receives the gated clock signal and disables the pre-determined number of bits of the operand. In another embodiment of the present invention, the gated clock signal disables pre-charge circuitry in the microprocessor, instead of the latching circuitry, to prevent the execution of unnecessary higher order bits.
In one aspect of the present invention, the bit detect unit detects zeros in the pre-determined number of bits of the operand. In another aspect of the present invention, the bit detect unit detects a one in the bits of the operand. A combination of zeros and ones could also be the condition detected by the bit detect unit for a pre-determined number of bits of the operand.
In another aspect of the present invention, the pre-determined number of bits is the uppermost 48-bits of the operand. In another aspect of the present invention, the pre-determined number of bits is the uppermost 31-bits of the operand. A combination of both 48-bits and 31-bits could also be the pre-determined number of bits of the operand. Other aspects of the present invention could include other bitwidths (bitfields), or other combinations of bitfields, as the pre-determined number of bits of the operand.
In another embodiment of the present invention, clock gating also includes an integer functional unit (execution unit), which executes the operand and creates a result. This embodiment also includes a multiplexer, which transfers the condition of the pre-determined number of bits onto a pre-determined number of bits of the result. The pre-determined number of bits of the result are the same, higher order bits, as those detected by the bit detection logic.
In another aspect of the present invention, the bit detect unit detects the condition of a pre-determined number of bits of an operand contained in a register. In other aspects of the present invention, the bit detect unit detects the condition of a pre-determined number of bits of the result, or detects the condition of the pre-determined number of bits of the result in combination with detecting the condition of the pre-determined number of bits of the operand in the register. Other aspects of the invention could detect the condition of pre-determined number of bits of an operand after register fetch of the operand, or during execution of the operand by condition detection of an intermediate carry out bit within the 64-bit execution. Other aspects of the invention could include any combination of the above condition detects.
In another embodiment of the present invention, clock gating is performed when the condition detect signal signifies the same condition for each of the two operands being executed. In other embodiments of the present invention, clock gating is performed when the condition detect signal signifies different conditions for the two operands being executed.
In another embodiment of the present invention, microprocessor performance is improved by operation packing, when the bit detect unit detects the condition of the pre-determined number of bits of an operand and generates a condition detect signal. Issue logic receives the condition detect signal and initiates an operation packing signal. Multiplexers receive the operation packing signal and move data from a lowermost sub-word of the operand onto an upper sub-word of an execution source operand bus, creating a parallel sub-word operation.
In another aspect of the present invention, the parallel sub-word operation contains four sub-words, with each sub-word containing 16-bits. In other aspects of the invention, the parallel sub-word operation could contain three or two sub-words, with each sub-word containing various bitwidths.
In another embodiment of the present invention, operation packing additionally includes an integer functional unit, which executes the parallel sub-word operation and creates a sub-word result. This embodiment also includes a second set of multiplexers, which move data from upper sub-words of the sub-word result onto a lowermost sub-word of an operand result, and a multiplexer transferring a bit condition onto the pre-determined number of bits of the operand result.
In another embodiment of the present invention, operation packing is performed when the condition detect signal signifies the same condition for each of the pairs of operands executed in the parallel sub-word operation. In other embodiments of the present invention, operation packing is performed when the condition detect signal signifies different conditions for each of the pairs of operands executed in the parallel sub-word operation.
In another embodiment of the present invention, thermal sensory data, or programmed switching selects whether clock gating or operation packing is implemented for the operand execution.