1. Field of the Invention
This invention relates generally to processor-based systems, and, more particularly, to handling lane crossing instructions for an execution pipeline in a processor-based system.
2. Description of the Related Art
Processors are typically designed using a pipeline architecture that divides the processing of each computer instruction into a series of independent steps. For example, a processor pipeline can be divided into an instruction fetch stage during which instructions are retrieved from memories or caches, an instruction decode stage in which the instructions are decoded, an execution stage in which the decoded instructions are executed, and a write-back stage in which the information generated during execution is written back into memory. Each stage is typically separated by a set of flip flops or registers for storing the output of the stage so that it can be used as input to the next stage during a subsequent clock cycle. Pipelining can improve the efficiency of processors significantly but it requires a high degree of coordination because each stage is typically operating on a different instruction during each clock cycle. Sequential instructions are therefore being processed concurrently. Stalls, branch delays, timing errors, and the like can all disrupt a pipelined architecture and reduce its efficiency.
Instructions that have been decoded (e.g., in the instruction decode stage) are typically stored in a bank of registers before being provided to the execution stage in the next cycle. Execution units within the execution stage can be divided or partitioned into different units. For example, in pipelined systems that handle 128 bit operands or instructions, the execution stage can be partitioned into a low execution unit that handles 64 of the instruction bits and a high execution unit that handles the other 64 bits in the instruction. The low execution unit typically handles the 64 least significant bits and the high execution unit typically handles the 64 most significant bits in the register. However, in some classes of instructions, mapping of the register locations to the execution stage inputs may be different than this default mapping. For example, in some cases the 64 least significant bits are swapped so that they are handled by the high execution unit and the 64 most significant bits are handled by the low execution unit. These instructions are referred to as “lane crossing” instructions. Other types of swapping and/or shuffling of the instruction bits can also be performed for different types of instructions. For example, sometimes two 32 bit chunks of data within a 64 bit portion of a source instruction are swapped. For another example, in a two-source instruction, 64b of data from one source ca be swapped with 64 bits from the other source before proceeding to the execution units.
Additional logic is needed to detect and perform the lane crossing and/or swapping on an instruction-by-instruction basis. Lane crossing and/or swapping the source data to the appropriate execution units therefore puts timing pressure on the pipeline. One possible solution is to insert an additional pipeline stage between the instruction decode stage and the execution stage. The additional pipeline stage is responsible for performing the appropriate lane crossing and/or swapping when needed for particular instructions. However, the additional stage adds one cycle of latency to all operations, which is detrimental to those operations that do not need the lane crossing stage. The majority of 128 bit instructions do not require the lane-crossing stage and so the majority of the additional latency introduced by the lane crossing stage is unnecessary.