The present invention relates generally to a superscalar processor and more particularly to the decode and routing of internal instructions to an asymetrical dispatch bus (that is not all instructions can be decoded/dispatched/executed for each and every slot) in such a processor.
Superscalar processors employ aggressive techniques to exploit instruction-level parallelism. Wide dispatch and issue paths place an upper bound on peak instruction throughput. Large issue buffers are used to maintain a window of instructions necessary for detecting parallelism, and a large pool of physical registers provides destinations for all of the in-flight instructions issued from the window beyond the dispatch boundary. To enable concurrent execution of instructions, the execution engine is composed of many parallel functional units. The fetch engine speculates past multiple branches in order to supply a continuous instruction stream to the decode, dispatch and execution pipelines in order to maintain a large window of potentially executable instructions.
The trend in superscalar design is to scale these techniques: wider dispatch/issue, larger windows, more physical registers, more functional units, and deeper speculation. To maintain this trend, it is important to balance all parts of the processor since any bottlenecks diminish the benefit of aggressive techniques.
Instruction fetch performance depends on a number of factors. Instruction cache hit rate and branch prediction accuracy have been long recognized as important problems in fetch performance and are well-researched areas.
Modem microprocessors routinely use a plurality of mechanisms to improve their ability to efficiently fetch past branch instructions. These prediction mechanisms allow a processor to fetch beyond a branch instruction before the outcome of the branch is known. For example, some mechanisms allow a processor to speculatively fetch beyond a branch before the branch""s target address has been computed. These techniques use run-time history to speculatively predict which instructions should be fetched and eliminate xe2x80x9cdeadxe2x80x9d cycles that might normally be wasted. Even with these techniques, current microprocessors are limited in fetching instructions during a clock cycle. As superscalar processors become more aggressive and attempt to execute many more instructions per cycle, they must also be able to fetch many more instructions per cycle.
High performance superscalar processor organizations divide naturally into an instruction fetch mechanism and an instruction execution mechanism. The fetch and execution mechanisms are separated by instruction issue buffer(s), for example, queues, reservation stations, etc. Conceptually, the instruction fetch mechanism acts as a xe2x80x9cproducerxe2x80x9d which fetches, decodes, and places instructions into a reorder buffer. The instruction execution engine xe2x80x9cpreparesxe2x80x9d instructions for completions. The completion engine is the xe2x80x9cconsumerxe2x80x9d which removes instructions from the buffer and executes them, subject to data dependence and resource constraints. Control dependencies (branches and jumps) provide a feedback mechanism between the producer and consumer.
As instruction fetch decode and dispatch pipelines become wider, it becomes important to optimize the translation from the complex instruction set with a large amount of implicit information to an explicit instruction set that does not require the use of architected registers. This is particularly true in situations where the internal instructions do not have a direct one to one relationship to the external instructions. This is typically done to facilitate faster cycle times, simplify design, or reduce the execution and/or register resources required for that instruction""s execution. Additionally, not all instructions may be executed in an early dispatch slot due to constraints on read/write ports into the register files, constraints on the amount of logic that can be used for functional units, and other cost/benefit tradeoffs. As dispatch widths become wider it becomes prohibitively expensive in both area and timing to implement all functions for all slots, therefore it is necessary to direct decoded instructions to the proper dispatch slots. However, for aggressively decomposed internal instruction sets this mechanism must already exist to allow for one-to-one, one-to-two, and one-to-many types of instruction decoding and expansion. Accordingly, a need exists for allowing instructions to be routed to the proper slots without constricting the operation of the processor. The present invention addresses such a need.
A method and system for aligning internal operations (IOPs) for dispatch are disclosed. The method and system comprise conditionally asserting a predecode based on a particular dispatch slot that an instruction is going to be placed. The method and system further include using the information related to the predecode to expand an instruction into at least one dummy operation and an IOP operation whenever the instruction would not be supported in the particular dispatch slot.