1. Field of the Invention
This invention relates to microprocessors configured to execute variable-length instruction sets, and in particular, to instruction decoders configured to decode and execute multiple bytes of instruction data in parallel.
2. Description of the Relevant Art
The number of software applications written for the x86 instruction set is immense. As a result, despite the introduction of newer and more advanced instruction sets, microprocessor designers have continued to design microprocessors capable of executing the x86 instruction set.
The x86 instruction set is relatively complex and is characterized by a plurality of variable-length instructions. This is in stark contrast with many RISC (reduced instruction set computer) formats which are fixed-length. A generic format illustrative of the x86 instruction set is shown in FIG. 1. As the figure illustrates, an x86 instruction consists of from one to five optional prefix bytes 102, followed by an operation code (opcode) field 104, an optional addressing mode (Mod R/M) byte 106, an optional scale-index-base (SIB) byte 108, an optional displacement field 110, and an optional immediate data field 112.
The opcode field 104 defines the basic operation for a particular instruction. The default operation of a particular opcode may be modified by one or more of the optional prefix bytes 102. For example, one of prefix bytes 102 may be used to override the default segment used in memory addressing or to instruct the processor to repeat a string operation a number of times.
Two prefix bytes are of particular importance. A prefix byte of 66(hex) represents the OPSIZ prefix, which reverses the default the operand size for an instruction. A prefix byte of 67(hex) represents the ADRSIZ prefix, which reverses the default the address size for an instruction. The default operand and address size of an instruction is determined by a bit (i.e., the D-bit or default bit) in the segment descriptor. If the default bit is set, then the default address and operand size is 32-bits. A prefix of 66(hex) or 67(hex) will override a set default bit, thereby allowing the instruction following the prefix to use a 16-bit opcode or address, respectively. Similarly, if the default bit is not set, then the default address and operand size is 16-bits. A prefix of 66(hex) or 67(hex) will override a set default bit, thereby allowing the instruction following the prefix to use a 32-bit opcode or address, respectively. Thus, not only does the length of an x86 instruction depend upon how many prefix bytes precede the instruction, but also upon the presence of prefixes 66(hex), 67(hex), and the value of the default bit in the instruction""s segment descriptor.
The opcode field 104 follows prefix bytes 102, if present, and may be one or two bytes in length. The addressing mode (Mod R/M) byte 106 specifies the registers used as well as memory addressing modes. The scale-index-base (SIB) byte 108 is used only in 32-bit base-relative addressing using scale and index factors. A base field within SIB byte 108 specifies which register contains the base value for the address calculation, and an index field within SIB byte 108 specifies which register contains the index value. A scale field within SIB byte 108 specifies the power of two by which the index value will be multiplied before being added, along with any displacement, to the base value. The next instruction field is a displacement field 110, which is optional and may be from one to four bytes in length. Displacement field 110 contains a constant used in address calculations. The optional immediate field 112, which may also be from one to four bytes in length, contains a constant used as an instruction operand. The shortest x86 instructions are only one byte long, and comprise a single opcode byte. The 80286 sets a maximum length for an instruction at 10 bytes, while the 80386 and 80486 both allow instruction lengths of up to 15 bytes.
The complexity of the x86 instruction set poses many difficulties in implementing high performance x86-compatible microprocessors. In particular, the variable length of x86 instructions, the nature of the prefix bytes, and reliance upon the segment descriptor makes scanning, aligning, and decoding instructions difficult. Scanning refers to reading a group of instruction bytes (either from an instruction cache within the microprocessor or from an external memory) and determining the boundaries of instructions contained therein. Alignment refers to the process of masking off the undesired instruction bytes and shifting the desired instruction so that the first bit of the desired instruction is in the desired position. Decoding instructions typically involves identifying each field within a particular instruction, e.g., the prefix, opcode and operand fields. Decoding typically takes place after the instruction has been fetched from the instruction cache, scanned, and aligned.
One method for determining the boundaries of instructions involves generating a number of predecode bits for each instruction byte read from main memory. The process of generating these predecode bits is referred to as xe2x80x9cpredecodingxe2x80x9d. The predecode bits provide information about the instruction byte they are associated with. For example, an asserted predecode start bit indicates that the associated instruction byte is the first byte of an instruction. Similarly, an asserted predecode end bit indicates that the associated instruction byte is the last byte of an instruction. Once the predecode bits for a particular instruction byte are calculated, they are stored together with the instruction byte in an instruction cache. When a xe2x80x9cfetchxe2x80x9d is performed, i.e., a number of instruction bytes are read from the instruction cache, the associated start and end bits are also read. The start and end bits may then be used to generate valid masks for the individual instructions with the fetch. A valid mask is a series of bits in which each bit corresponds to a particular instruction byte. Valid mask bits associated with the first byte of an instruction, the last byte of the instruction, and all bytes in between the first and last bytes of the instruction are asserted. All other valid mask bits are not asserted.
Turning now to FIG. 2, an exemplary valid mask is shown. The figure illustrates a portion of a fetch block 120 and its associated start and end bits 122 and 124. Assuming a valid mask 126 for instruction B 128 is to be generated, start and end bits 122 and 124 would be used to generate the mask. Valid mask 126 could then be used to mask off all bytes within fetch 120 that are not part of instruction B 128. Once the boundaries of an instruction have been determined, each instruction is typically aligned and sent to a decoder.
Although the predecoding technique described above has been largely successful, in some cases almost fifty percent of the available storage space within the instruction cache array is allocated for the predecode bits. This accordingly limits the amount of storage within the instruction cache for instruction bytes and/or increases the cost of the processor due to increased die size. In addition, the process of aligning each individual instruction for decoding may further increase the overall time to execution for instructions. For these reasons, a method and apparatus for rapidly decoding instructions without the use of extensive predecode information is needed.
The problems outlined above may in part be solved by a microprocessor capable of decoding a plurality of instructions in parallel. This may be accomplished through the use of multiple combination decoder/execution units configured to operate in parallel. Advantageously, wide parallel decoding of x86 instructions may improve instruction throughput while reducing or eliminating the need for devoting large portions of the instruction cache for predecode information.
In one embodiment, a microprocessor configured to decode multiple instructions in parallel may include an instruction cache, a plurality of parallel decode units, and a bus coupling the decode units. The instruction cache is configured to receive and store instruction bytes from a main system memory. Each of the plurality of decode units are configured to receive at least one instruction byte from the instruction cache during a particular clock cycle. Using the bus coupling the decode units, the decode units are each configured to cross talk to identify the boundaries of instructions formed by the instruction bytes. The decode units are configured to detect and execute simple instructions formed by the instruction bytes. The decode units may also be configured to forward complex instructions to a set of dedicated functional units for execution. The decode units may also be configured to allocate an entry in a reorder buffer for each instruction that is decoded (regardless of whether the decoded instruction is simple or complex). In some embodiments a simple instruction may be an instruction that does not have a dependency upon any instructions that have not yet executed. Similarly, in some embodiments simple instructions may be further restricted to instructions that do not alter the microprocessor""s state (e.g., control and/or status words).
A method for predecoding instructions having varying address and operand sizes is also contemplated. In one embodiment, the method includes reading a plurality of instruction bytes from an instruction cache and routing each instruction byte to one of a plurality of decoders. The decoders detect instruction boundaries and execute simple instructions. Complex instructions are forwarded to reservation stations for eventual execution by functional units.
A computer system capable of rapidly predecoding a large number of instructions bytes is also contemplated. The computer system may comprise a microprocessor as described above, a CPU bus coupled to the microprocessor; and a communications device (e.g., a modem) coupled to the microprocessor via the CPU bus. In one embodiment, the computer system may have multiple microprocessors coupled to each other via the CPU bus.