As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multithreading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector. The aforementioned techniques may also be combined, resulting in a multithreaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process “vectors” of data points at the same time. In addition, multiple execution units may be used to permit independent operations to be performed in parallel, further increasing overall performance.
The various techniques for improving execution unit performance, however, do not come without a cost. Parallelism adds complexity, often requiring a greater number of logic gates, which increases both the size and the power consumption of such execution units. Coupling these techniques with the general desire to increase performance through other techniques, such as increased switching frequency, the power consumption of complex, high performance execution units continues to increase, despite efforts to reduce such power consumption through process improvements. Excessive power consumption can present issues for portable or battery powered devices, but more typically, excessive power consumption presents issues for nearly all electronic circuits due to the generation of heat, which often requires elaborate cooling systems to ensure that a circuit does not overheat and fail.
Chip-wide control over power consumption is often used in electronic circuits such as those used in laptop computers or other portable devices, typically by throttling down the clock rate or frequency of the circuit to reduce power consumption and the generation of heat. In addition, power consumption may also be reduced in some instances by temporarily shutting down unused circuits on a chip. In all of these instances, however, throttling back the power consumption of the circuit usually results in lower performance in the chip. Furthermore, the circuit characteristics that define the overall power consumption of such circuits, e.g., cycle time, voltage, logic area, capacitance, etc., are most often designed to meet a maximum performance target.
One area where power is often needlessly consumed in an execution unit involves executing dependent instructions. In a typical pipelined execution unit, instructions are operated upon using a series of sequential steps, where the steps are implemented by “stages” of execution logic, so that different stages of the pipeline can be working on one step of a different instruction concurrently. For instance, a typical floating point execution pipeline might be performing a multiply operation for one instruction at the same time it is performing an add for an older instruction and rounding the final result for an even older instruction.
In a typical execution pipeline, the execution logic in each stage is specifically designed to complete the operations it must complete in order to implement a step within the span of a single execution unit clock cycle, so that the output of the stage is available for use by the next stage in the pipeline when the execution unit cycles to its next execution cycle. A set of latches is typically provided before the execution logic in each pipeline stage to latch the input operands (in the case of a first stage in the pipeline) or the results of the previous stage and provide that data as input to the associated execution logic. Therefore, during each cycle, the output of a previous stage is latched into the latches for the next stage, and the latched data is fed as input to the execution logic in the next stage. The output of the execution logic stabilizes prior to the next cycle, while the input data is held in the latches, and when the next cycle occurs, the output is latched by the latches in the next stage so that the data can be processed by the execution logic in the next stage.
As noted above, however, oftentimes some stages of a pipeline are not active due to the absence of a valid instruction at that stage of the pipeline. This may occur for a variety of reasons, but one dominant cause is inter-instruction dependencies, which can occur when a new instruction must be stalled from entering the execution pipeline because an older instruction is still working on producing a result to be used by the newer instruction. For these unused cycles, the associated circuitry/logic for that stage (for instance, an adder in a particular unused cycle) may still undergo a great deal of switching. Furthermore, from the standpoint of power consumption, it is often the latches in a pipeline stage that draw the most power. Consequently, the switching of latches and other circuitry in an invalid pipeline stage is often the source of unneeded power consumption.
Therefore, a need continues to exist in the art for a manner of decreasing the power consumption of pipelined execution units, and in particular, for a manner of decreasing the power consumption of pipelined execution units resulting from inter-instruction dependencies.