As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processing cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multi-threading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A SIMD or vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, an SIMD or vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multi-threaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to an SIMD execution unit to process “vectors” of data points at the same time.
In addition, it is also possible to employ multiple execution units in the same processor to provide additional parallelization. The multiple execution units may be specialized to handle different types of instructions, or may be similarly configured to process the same types of instructions.
Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread in a multi-threaded architecture is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline(s) kept at a minimum. In addition, when multiple execution units are used, the issuance of instructions to such execution units may be handled by the same issue unit, or alternatively by separate issue units.
Another technique that may be used to improve the performance of a processor is to employ a microcode unit or sequencer to automatically generate instructions for execution by an execution unit. A microcode unit or sequencer responds to commands, e.g., via dedicated instructions in an instruction set, and in response, outputs a sequence of instructions to be executed by the processor. In much the same way that a software procedure can be used to perform a repeatable sequence of steps in response to a procedure call in a software program, a microcode unit or sequencer can be triggered by a command or instruction to perform a repeatable operation.
Microcode units or sequencers are particularly useful for performing long latency operations, i.e., operations that take a relatively long time to perform, and in the case of pipelined execution units, often require multiple passes through an execution pipeline. Typically, a microcode unit or sequencer maps particular instructions in an instruction set architecture to a sequence of instructions so that, upon an issue unit receiving an instruction designated for the microcode unit (referred to herein as a microcode instruction), the issue unit will route the instruction to the microcode unit, which then temporarily stalls the issue unit and outputs the sequence of instructions to an execution unit.
The mapping of microcode instructions to sequences of instructions is typically maintained in a read only memory (ROM) or hard coded into the microcode unit. As a result, microcode units are typically custom designed for particular applications.
However, as computers and other programmable electronic devices continue to be integrated deeper and deeper into every aspect of society, and as programmable chips such as microprocessors, microcontrollers, Application Specific Integrated Circuits (ASIC's) and the like continue to increase in complexity and power while costs, the design, verification and testing of such programmable chips has become a significant contributor to the overall costs of such chips. For this reason, design reuse is employed whenever possible so that portions of a programmable chip, such as particular processing core designs, functional units, and other logic blocks, which have previously been designed, tested and verified, do not need to be recreated from scratch.
Nonetheless, increasing specialization of processor designs often limits the ability to reuse components in different designs. From the perspective of a microcode unit, for example, limits on the size of the unit often limits the number of microcode instructions that can be supported. Furthermore, instruction sets are often limited in size, so allocating a large number of the available instructions in an instruction set to a microcode unit limits the other types of instructions that can be supported. Consequently, conventional microcode units are typically relatively small and limited in scope, and optimized for handling a few specialized instruction sequences. Different processor designs intended for different applications, which might otherwise utilize very similar hardware circuitry, may nonetheless require different microcode units in order to support those different applications.
Another shortcoming of conventional microcode units is that since the units are designed to support a specific set of microcode instructions and instruction sequences, any faults in the designs are essentially fabricated as hard coded logic in the processor chips themselves, so there is typically no way to correct any such faults in any manufactured chips.
Therefore, a significant need continues to exist in the art for a manner of facilitating the development of application-specific programmable chips and electronic devices incorporating the same, particularly with regard to the microcode units therein.