1. Field of the Invention
This invention is related to the field of processors and, more specifically, to fetch and dispatch of instructions in processors.
2. Description of the Related Art
Superscalar processors attempt to achieve high performance by processing multiple instructions in parallel. For example, superscalar processors typically include multiple parallel execution units, each configured to independently execute operations. In order to provide enough instructions to effectively make use of the parallel execution units, superscalar processor attempt to rapidly fetch and decode multiple instructions, and transmit them to the instruction scheduling mechanism.
Since operand dependencies between instructions need to be respected, the program order of the fetched and decoded instructions must be discernable so that dependency checking can be performed. For example, processors that implement register renaming often perform the dependency checking as part of the register renaming operation.
The program order of instructions transmitted in different clock cycles is typically apparent: instructions transmitted in earlier clock cycles are older than instructions transmitted in later clock cycles. An older instruction is prior to a younger instruction in the program order. The program order can be speculative, if branch prediction is implemented to direct fetching, for example.
Among instructions that are transmitted concurrently (e.g. in the same clock cycle), the program order is less apparent. To ensure that program order can be discerned, many processors assign a static program order among the parallel decoders. The decoders and other hardware can be viewed as slots to which instructions can be transmitted. The first instruction in program order is transmitted to slot 0, the second instruction in program order is transmitted to slot 1, etc. Thus, the program order of the concurrently transmitted instructions is apparent from the slots to which the instructions were transmitted.
FIG. 1 is an example of such operation for three slots (three concurrently transmitted instructions). Of course, any number of slots can be implemented. Also shown in FIG. 1 is an exemplary sequence of instructions I0 to I10, where the speculative program order of the instructions flows from top to bottom in FIG. 1 (e.g. I0 is first, I1 is second, etc., according to the speculative program order). For various reasons, less than three instructions are issued in some clock cycles (e.g. not enough instructions available from fetching, implementation-dependent constraints, etc.).
As illustrated in FIG. 1, the first instruction in program order in each transmission cycle (labeled D0 to D4 in FIG. 1) is always issued to slot 0. The second instruction in program order, if any, is always issued to slot 1 and the third instruction in program order, if any, is always issued to slot 2. Thus, the program order of the concurrently transmitted instructions is slot 0, then slot 1, and then slot 2.
Implementing instruction transmission in FIG. 1 typically includes a relatively complex rotation mechanism to align the first instruction in program order to slot 0. The rotation mechanism is dependent on the number of instructions previously transmitted and the location of the first instruction in the fetched instructions. Additionally, the resources associated with slot 0 are generally more highly utilized than other slots. If the slots are symmetrical in terms of resources, the resources assigned to slot 0 dictate the achievable parallelism of the processor as a whole. On the other hand, if more resources are assigned to slot 0 than the other slots (and more resources are assigned to slot 1 than slot 2), the implementation is more complex due to the differences between slots. Other proposed mechanisms permit the first instruction to be transmitted to a slot other than 0, but concurrently transmitted instructions are transmitted to higher-numbered slots. Thus, complex rotations are still used in such implementations.