1. Field of the Invention
This invention relates to the field of microprocessing, more specifically the present invention relates to an apparatus and method for executing microprocessor instructions grouped in dependency chains.
2. Description of the Related Art
Most current processors belong to a category of processors called superscalar processors. This category is further divided into RISC (Reduced Instruction Set Computer) or CISC (Complex Instructions Set Computer) processors. These processors are comprised of multiple internal processing units with circuitry for dispatching multiple instructions to these processing units. Superscalar processors fetch a sequence of instructions in program order. In this architecture, each instruction is a single operation. The dispatching circuitry of superscalar processors allows the execution of multiple instructions in a single processor cycle using a queue of instructions, also referred to as a pipeline.
This architecture also includes circuitry that searches within the pipeline for instructions capable of being executed at the same time. Widening the pipeline makes it possible to execute a greater number of instructions per cycle. However, there is no guaranty that any given sequence of instructions can take advantage of this capability. Instructions are not independent of one another but are inter-related. These inter-relationships prevent some instructions from being executed until other instructions have been executed, thus, preventing the use of the full capabilities of the processor to execute multiple instructions simultaneously.
Very Long Instructions Word (VLIW) processors constitute another category of processors where each instruction allows execution of multiple operations. Each operation from an instruction corresponds to an internal processing unit. VLIW processors are simpler than superscalar processors in that the dispatching of operations to multiple execution units is accomplished at the instruction level. Because a single VLIW instruction can specify multiple operations, the VLIW processors are capable of reducing the number of instructions required for a program. However, in order for the VLIW processor to sustain an average number of cycles per instruction comparable to the rate of a superscalar processor, the operations specified by VLIW instruction must be independent from one another. Otherwise, the VLIW instruction is similar to a sequential multiple operation CISC instruction and the number of cycles per instruction goes up accordingly. The instruction set or length of the VLIW processor is normally quite large taking many bits to encode multiple operations.
VLIW processors rely on software to pack the collection of operations representing a program into instructions. To do this, software uses a technique called compaction. Densely compacting operations into an instruction improves performance and encoding efficiency. During compaction, null operations are used in instructions where other operations cannot be used. Compaction serves as a limited form of out-of-order issue because operations are placed into instructions in many different orders. To compact the instructions, software must be able to detect independent operations and this can restrict the processor architecture, the application or both.
Both superscalar and VLIW processors make use of the concept referred to as instruction level parallelism (ILP). ILP architectures allow parallel computation of the lowest level machine operations such as memory loads, stores, integer additions and floating point multiplications within a single instruction cycle. ILP architectures contain multiple functional units and/or pipelined functional units but have a single program counter and operate on a single instruction stream. For ILP architectures effective hardware usage requires that the single instruction stream be ordered such that whenever possible, multiple low level operations can be in execution simultaneously. High performance microprocessors of both categories have focused on exploiting ILP and thus independent space representations of instruction groups. Pipelined and superscalar processors use hardware to check for independence of an instruction with all previous in-flight instructions prior to issuing an instruction for execution. These processors can execute instructions out-of-order in their search for ILP. A dependent instruction does not block the execution of subsequent independent instructions. In VLIW processors on the other hand, the compiler is relied upon to identify groups of independent operations, execute the operations of a group in parallel and execute different operation groups in program order.
Trace processors are a third type of processor which make use of short dynamic instruction sequences called traces. Trace processors improve upon superscalar processors by recording the instruction dependencies detected within a trace upon first visit of the trace and reuse this information on subsequent visits to the trace rather than recomputing the dependencies. A trace can be dynamically rescheduled to optimize ILP within the trace. However, limited ILP is exploited within a trace and multiple traces need to execute in parallel to exploit more ILP. Traces are likely to have multiple dependency links between them requiring inter-trace communication and forcing serialization between traces.
It is difficult to further scale most current superscalar techniques to get significantly more ILP. Small increases of ILP can require inordinate hardware complexity. There are still significant amounts of “far flung ILP” to be harvested. Thus, there is a need to develop complexity efficient microarchitecture implementations to harvest this far-flung ILP.