As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processing cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
These various techniques for improving processing unit performance, however, do not come without a cost. Parallelism adds complexity, often requiring a greater number of logic gates, which increases both the size and the power consumption of such processing units. Coupling these techniques with the general desire to increase performance through other techniques, such as increased switching frequency, the power consumption of complex, high performance processing units continues to increase, despite efforts to reduce such power consumption through process improvements. Excessive power consumption can present issues for portable or battery powered devices, but more typically, excessive power consumption presents issues for nearly all electronic circuits due to the generation of heat, which often requires elaborate cooling systems to ensure that a circuit does not overheat and fail.
Due to these competing concerns, therefore, designers of microprocessors and other types of processing units often must balance the desire to incorporate sufficient logic circuitry to efficiently execute expected workloads with the need to minimize the amount of logic circuitry for power and cost concerns.
One area in which these competing concerns is often raised is that of non-pipelined instructions such as multiplies, divides, square roots, and other complicated math operations. Whereas most instructions in a processing unit are capable of being executed by pipelined execution logic, non-pipelined instructions typically must be executed serially, i.e., with only one instruction executed at a time rather than performing multiple stages of multiple instructions in parallel. It has been found that, in particular, the algorithms required to compute such complicated instructions are themselves complicated and typically must be broken down into iterative solutions. In addition, since a loop is often involved in the performance of such instructions, pipelining is often not feasible, as collisions would likely occur when the loop is attempted.
As a result, direct implementation of non-pipelined instructions in hardware often requires complex, dedicated execution logic involving relatively long latencies for completion. In fact, in many instances, the cost of implementing the instructions directly in hardware is too high from both a power and area point of view, resulting in many processor designs implementing non-pipelined instructions indirectly by running recursive loops through simpler and shorter sets of math operations that eventually produce the correct results. The recursive loops, however, require additional processor cycles to complete, thereby increasing the latency even beyond that of direct implementations.
Regardless of whether non-pipelined instructions are implemented directly or indirectly, additional delays often result for subsequent instructions in an instruction stream. Thus, if an execution unit is currently executing a non-pipelined instruction, newer non-pipelined instructions typically must wait for the older instruction to finish before they can be issued. In some architectures, some non-pipelined instructions may even block any new instruction, even pipelined instructions, from being issued. In either case, this can cause serious performance degradation for many applications.
One approach for addressing dependencies associated with non-pipelined instructions is to utilize multiple instances of non-pipelined execution logic within an execution unit of a processing unit to handle such instructions, such that if one instruction is being executed by one instance of the non-pipelined execution logic, subsequent instructions may be forwarded to other instances for execution. However, as noted above, the execution logic used to execute non-pipelined instructions is typically complex in nature, so incorporating multiple instances of such logic is usually not desirable, particularly where cost and power consumption are of concern.
Therefore, a significant need continues to exist in the art for a manner of quickly, efficiently and cost-effectively executing non-pipelined instructions in a processing unit.