1. Field of the Invention
This invention is related to the field of microprocessors and, more particularly, to the alignment of instructions to issue positions within microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Microprocessor designers often design their products in accordance with the x86 microprocessor architecture in order to take advantage of its widespread acceptance in the computer industry. Because the x86 microprocessor architecture is pervasive, many computer programs are written in accordance with the architecture. X86 compatible microprocessors may execute these computer programs, thereby becoming more attractive to computer system designers who desire x86-capable computer systems. Such computer systems are often well received within the industry due to the wide range of available computer programs.
The x86 architecture specifies a variable length instruction set (i.e. an instruction set in which various instructions employ differing numbers of bytes to specify that instruction). For example, the 80386 and later versions of x86 microprocessors employ between 1 and 15 bytes to specify a particular instruction. Instructions have an opcode, which may be 1-2 bytes, and additional bytes may be added to specify addressing modes, operands, and additional details regarding the instruction to be executed.
Unfortunately, having variable length instructions creates numerous problems for dispatching multiple instructions per clock cycle. Because the instructions have differing numbers of bytes, an instruction may begin at any memory address. Conversely, fixed length instructions typically begin at a known location. For example, a 4 byte fixed length instruction set has instructions which begin at 4 byte boundaries within memory (i.e. the two least significant bits are zeros for the memory addresses at which instructions begin).
In order to locate multiple variable length instructions during a clock cycle, instruction bytes fetched by the microprocessor may be serially scanned to determine instruction boundaries and thereby locate instructions which may be concurrently dispatched. Serial scanning involves a large amount of logic, and typically a large number of cascaded logic levels. For high frequency (i.e. short clock cycle time) microprocessors, large numbers of cascaded logic levels may be deleterious to the performance of the microprocessor.
Some microprocessor designs employ predecoding to identify the beginning and end of instructions as the instructions are stored into an instruction cache within the microprocessor. Even with predecoding, locating and dispatching multiple instructions per clock cycle is a complex and often clock-cycle-limiting operation. This process is even further exacerbated by the need to identify the type of instruction, in order to select an appropriate functional unit to which the instruction should be dispatched. Often, different functional units are designed to execute different types of instructions (i.e. different subsets of the instruction set). Once multiple instructions are detected and the types determined, which of the instructions may be dispatched is determined by selecting the first instructions of various types to be conveyed to corresponding functional units.
A method for simplifying the instruction selection process is to provide symmetrical functional units (i.e. functional units configured to execute the same subset of instructions). In this manner, instruction type determinations are not needed, since each functional unit executes the same subset of instructions. Such a solution may then select the first-identified instruction (in program order) to be dispatched to the first functional unit, the second-identified instruction (in program order) to be dispatched to the second functional unit, etc. Unfortunately, a selection method as described may lead to an imbalance in the number of instructions dispatched to each functional unit. Due to various restrictions (e.g. dispatch rules to simplify the hardware within the microprocessor, misses in the cache, or the lack of a sufficient number of instructions within the instruction bytes available for scanning, etc.) the number of instructions identified for dispatch during a clock cycle may be less than a maximum number of concurrently dispatchable instructions. The maximum number of concurrently dispatchable instructions may be related to the number of functional units (e.g. one instruction per functional unit), or may be determined by the number of instructions which can be identified concurrently within a certain clock frequency goal, etc. If, for example, one instruction is identified, the instruction may be dispatch to the first functional unit while no instructions are dispatched to other functional units. The first functional unit's resources (e.g. reservation stations, etc.) may then tend to fill more quickly than other functional unit's resources. Even though other functional units may have available resources, the lack of resources in the first functional unit may cause undesirable stalls in the dispatch of instructions.