Modern information handling systems (IHSs) often employ processors that include multiple stages that together form a pipeline. For example, a pipelined processor may include a fetch unit, a decoder, an instruction queue, a number of execution units, and a completion or writeback unit. The fetch unit fetches instructions from a memory cache or system memory to provide an instruction stream. The decoder decodes the fetched instructions into opcodes and operands. An instruction queue or dispatch unit sends decoded instructions to appropriate execution units for execution. A completion or writeback unit writes the completed results back to an appropriate processor register or memory. While one stage of the pipelined processor performs a task on one instruction, another stage performs a different task on another instruction. For example, the fetch unit fetches a first instruction from an instruction cache. Next, while the decoder decodes the fetched first instruction, the fetch unit fetches another instruction from the instruction cache. Breaking instruction handling into separate tasks or stages in this manner may significantly increase processor performance.
Some instructions take longer to execute than others. A single cycle instruction typically takes one clock cycle to execute in an execution stage of a pipeline. In contrast, a multi-cycle instruction takes multiple clock cycles to execute in the execution stage of the pipeline. For this reason, a single clock cycle instruction exhibits relatively low latency, while a multi-cycle instruction exhibits relatively high latency in comparison. When a processor dispatches a high latency instruction, such as a multiply instruction (e.g. “mullw” or “mulld”) to an execution unit, other instructions or operations that depend on the high latency instruction may stall in the pipeline until the high latency instruction completes.
To increase performance, some processors fuse or merge certain instructions together to form new instructions in the processor's instruction set. For example, the Power PC architecture employs a floating point multiply add instruction that fuses an add instruction to a floating point multiply instruction. Unfortunately, however, adding new instructions to an existing architecture consumes additional opcode space. Such new instructions may also force all implementations of the processor to support the structures necessary for executing a fused-op. This is not desirable for architectures that attempt to span a product range from embedded applications at one end to high-end servers at the other. Fusing instructions near the beginning of the pipeline may complicate both the processor's control hierarchy and logic structures. This approach may also require that an instruction queue in the pipeline handle more operands than otherwise required.
What is needed is a processor apparatus and methodology that addresses the instruction handling problems above.