1. Field of the Invention
This invention relates to caching instructions in microprocessors, and more particularly to caching instructions using basic blocks.
2. Description of the Relevant Art
In their continuing effort to improve the performance of microprocessors, designers have increased operating frequencies while also increasing the number of instructions executed per clock cycle. As used herein, the term "clock cycle" refers to an interval of time during which each pipeline stage of a microprocessor performs its intended functions. At the end of each clock cycle, the resulting values are moved to the next pipeline stage. These higher frequencies and increases in concurrently executed instructions have caused designers to seek methods for simplifying the tasks performed during each pipeline stage. One way designers have achieved the desired simplification is to limit the number and variation of instructions the microprocessor must execute. These microprocessors are referred to as Reduced Instruction Set Computer (RISC) processors.
Despite the apparent advantages of RISC architectures, the widespread acceptance of the x86 family of microprocessors has forced manufacturers to continue to develop higher operating frequency, multiple-issue microprocessors capable of executing the more complex x86 instruction set. Designers have had reasonable success in increasing the performance of x86 compatible microprocessors by aggressively implementing features such as pipelining, out-of-order execution, branch prediction, and issuing multiple instructions for concurrent execution. Such "superscalar" microprocessors achieve relatively high performance characteristics while advantageously maintaining backwards compatibility with the vast amount of existing software developed for previous microprocessor generations such as the 8086, 80286, 80386, and 80486.
As previously noted, the x86 instruction set is relatively complex and is characterized by a plurality of variable length instructions. A generic format illustrative of the x86 instruction set is shown in FIG. 1. As the figure illustrates, an x86 instruction consists of from one to five optional prefix bytes 202, followed by an operation code (opcode) field 204, an optional addressing mode (Mod R/M) byte 206, an optional scale-index-base (SIB) byte 208, an optional displacement field 210, and an optional immediate data field 212.
The opcode field 204 defines the basic operation for a particular instruction. The default operation of a particular opcode may be modified by one or more prefix bytes. For example, a prefix byte may be used to change the address or operand size for an instruction, to override the default segment used in memory addressing, or to instruct the processor to repeat the operation a number of times. The opcode field 204 follows the prefix bytes 202, if any, and may be one or two bytes in length. The addressing mode (Mod R/M) byte 206 specifies the registers used as well as memory addressing modes. The scale-index-base (SIB) byte 208 is used only in 32-bit base-relative addressing using scale and index factors. A base field of the SIB byte specifies which register contains the base value for the address calculation, and an index field specifies which register contains the index value. A scale field specifies the power of two by which the index value will be multiplied before being added, along with any displacement, to the base value. The next instruction field is the optional displacement field 210, which may be from one to four bytes in length. The displacement field 210 contains a constant used in address calculations. The optional immediate field 212, which may also be from one to four bytes in length, contains a constant used as an instruction operand. The shortest x86 instructions are only one byte long and comprise a single opcode byte. The 80286 sets a maximum length for an instruction at 10 bytes, while the 80386 and 80486 both allow instruction lengths of up to 15 bytes.
The complexity of the x86 instruction set poses many difficulties in implementing high performance x86-compatible superscalar microprocessors. One particular difficulty arising from the variable-length nature of the x86 instruction set is fetching instructions from an instruction cache. The term "fetching" refers to reading an instruction from a cache (or if it is not in the cache, then from main memory) and routing the instruction to the appropriate decode and or functional unit within the microprocessor for decoding and execution. Caches are low-latency, high-bandwidth memories either on the same monolithic chip as the microprocessor or on a separate chip mounted in close proximity to the microprocessor. Caches are typically structured as an array of storage locations, wherein each storage location is configured to store a predetermined number of instruction bytes. For example, a typical instruction cache may store 32 kilobytes and may be configured with individual storage locations each capable of storing 32 bytes. Each storage location is typically referred to as a "cache line".
Caches may be configured in a number of different ways. For example, many caches are set-associative, meaning that a particular line of instruction bytes may be stored in a number of different locations within the array. In a set-associative structure, the cache is configured into two parts, a data array and a tag array. Both arrays are two-dimensional and are organized into rows and columns. The column is typically referred to as the "way." Thus a four-way set-associative cache would be configured with four columns. A set-associative cache is accessed by specifying a row in the data array and then examining the tags in the corresponding row of the tag array. For example, when a prefetch unit searches its instruction cache for instructions residing at a particular address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the data array and a corresponding row within the tag array. The number of address bits required for the index are thus determined by the number of rows configured into the cache. The tags addresses within the selected row are examined to determine if any match the requested address. If a match is found, the access is said to be a "hit" and the data cache provides the associated instruction bytes from the data array. If a match is not found, the access is said to be a "miss." When a miss is detected, the prefetch unit causes the requested instruction bytes to be transferred from the memory system into the data array. The address associated with the instruction bytes is then stored in the tag array.
Instruction bytes are read from main memory and then stored in the instruction cache until they are needed. In some embodiments, microprocessors may "predecode" the instruction bytes before they are stored in the instruction cache. Predecoding typically involves identifying the boundaries between consecutive instructions and possibly identifying the opcode bytes within the instruction. This predecode information is typically stored with the instruction bytes in the instruction cache. When instructions are fetched from the instruction cache, the predecode information is used to speed the alignment and decoding of the instructions.
After a requested instruction address is output to main memory, a predetermined number of sequential instruction bytes beginning at the requested address are read from main memory, predecoded, and then conveyed to the instruction cache for storage. The instruction bytes are stored into storage locations ("cache lines") according to their address, typically without regard to what types of instructions are contained within the sequence of instruction bytes.
One drawback, however, of traditional caches is that they suffer from inefficiencies because branch instructions and branch targets do not naturally occur at cache line boundaries. This may deleteriously affect performance because taken branch instructions residing in the middle of a cache line may cause the end portion of the cache line to be discarded when it is fetched. Furthermore, branch targets that are not located at the start of a cache line may similarly cause the beginning portion of the cache line to be discarded. For example, upon receiving a fetch address, the typical instruction cache reads the entire corresponding cache line, and then selection logic (either internal or external to the instruction cache) selects the desired instructions and discards instruction bytes before the target address and or after a branch instruction.
In addition to discarding fetched instruction bytes, an additional performance penalty results from the alignment required before the instruction bytes can be properly decoded. While the cache-related problems highlighted above may occur in both RISC and x86 instruction sets, the problems are typically aggravated by the variable-length nature of x86 instructions.
Thus, a method and apparatus for more easily accessing instruction bytes stored in a cache is desired. In addition, a method that would improve the cache performance of both RISC microprocessors and x86 compatible microprocessors would be particularly desirable.