1. Field of the Invention
This invention relates to caching instructions in microprocessors, and more particularly to caching instructions using basic blocks.
2. Description of the Relevant Art
In their continuing effort to improve the performance of microprocessors, designers have increased operating frequencies while also increasing the number of instructions executed per clock cycle. As used herein, the term xe2x80x9cclock cyclexe2x80x9d refers to an interval of time during which each pipeline stage of a microprocessor performs its intended functions. At the end of each clock cycle, the resulting values are moved to the next pipeline stage. These higher frequencies and increases in concurrently executed instructions have caused designers to seek methods for simplifying the tasks performed during each pipeline stage. One way designers have achieved the desired simplification is to limit the number and variation of instructions the microprocessor executes. These microprocessors are referred to as Reduced Instruction Set Computer (RISC) processors.
Despite the apparent advantages of RISC architectures, the widespread acceptance of the x86 family of microprocessors has forced manufacturers to continue to develop higher operating frequency, multiple-issue microprocessors capable of executing the more complex x86 instruction set. Designers have had reasonable success in increasing the performance of x86 compatible microprocessors by aggressively implementing features such as pipelining, out-of-order execution, branch prediction, and issuing multiple instructions for concurrent execution. Such xe2x80x9csuperscalarxe2x80x9d microprocessors achieve relatively high performance characteristics while advantageously maintaining backwards compatibility with the vast amount of existing software developed for previous microprocessor generations such as the 8086, 80286, 80386, and 80486.
As previously noted, the x86 instruction set is relatively complex and is characterized by a plurality of variable length instructions. A generic format illustrative of the x86 instruction set is shown in FIG. 1. As the figure illustrates, an x86 instruction consists of from one to five optional prefix bytes 202, followed by an operation code (opcode) field 204, an optional addressing mode (Mod R/M) byte 206, an optional scale-index-base (SIB) byte 208, an optional displacement field 210, and an optional immediate data field 212.
The opcode field 204 defines the basic operation for a particular instruction. The default operation of a particular opcode may be modified by one or more prefix bytes. For example, a prefix byte may be used to change the address or operand size for an instruction, to override the default segment used in memory addressing, or to instruct the processor to repeat the operation a number of times. The opcode field 204 follows the prefix bytes 202, if any, and may be one or two bytes in length. The addressing mode (Mod R/M) byte 206 specifies the registers used as well as memory addressing modes. The scale-index-base (SIB) byte 208 is used only in 32-bit base-relative addressing using scale and index factors. A base field of the SIB byte specifies which register contains the base value for the address calculation, and an index field specifies which register contains the index value. A scale field specifies the power of two by which the index value will be multiplied before being added, along with any displacement, to the base value. The next instruction field is the optional displacement field 210, which may be from one to four bytes in length. The displacement field 210 contains a constant used in address calculations. The optional immediate field 212, which may also be from one to four bytes in length, contains a constant used as an instruction operand. The shortest x86 instructions are only one byte long and comprise a single opcode byte. The 80286 sets a maximum length for an instruction at 10 bytes, while the 80386 and 80486 both allow instruction lengths of up to 15 bytes.
The complexity of the x86 instruction set poses many difficulties in implementing high performance x86-compatible superscalar microprocessors. One particular difficulty arising from the variable-length nature of the x86 instruction set is fetching instructions from an instruction cache. The term xe2x80x9cfetchingxe2x80x9d refers to reading an instruction from a cache (or if it is not in the cache, then from main memory) and routing the instruction to the appropriate decode and or functional unit within the microprocessor for decoding and execution. Caches are low-latency, high-bandwidth memories either on the same monolithic chip as the microprocessor or on a separate chip mounted in close proximity to the microprocessor. Caches are typically structured as an array of storage locations, wherein each storage location is configured to store a predetermined number of instruction bytes. For example, a typical instruction cache may store 32 kilobytes and may be configured with individual storage locations each capable of storing 32 bytes. Each storage location is typically referred to as a xe2x80x9ccache linexe2x80x9d.
Caches may be configured in a number of different ways. For example, many caches are set-associative, meaning that a particular line of instruction bytes may be stored in a number of different locations within the array. In a set-associative structure, the cache is configured into two parts, a data array and a tag array. Both arrays are two-dimensional and are organized into rows and columns. The column is typically referred to as the xe2x80x9cway.xe2x80x9d Thus a four-way set-associative cache would be configured with four columns. A set-associative cache is accessed by specifying a row in the data array and then examining the tags in the corresponding row of the tag array. For example, when a prefetch unit searches its instruction cache for instructions residing at a particular address, a number of bits from the address are used as an xe2x80x9cindexxe2x80x9d into the cache. The index selects a particular row within the data array and a corresponding row within the tag array. The number of address bits used for the index are thus determined by the number of rows configured into the cache. The tags addresses within the selected row are examined to determine if any match the requested address. If a match is found, the access is said to be a xe2x80x9chitxe2x80x9d and the data cache provides the associated instruction bytes from the data array. If a match is not found, the access is said to be a xe2x80x9cmiss.xe2x80x9d When a miss is detected, the prefetch unit causes the requested instruction bytes to be transferred from the memory system into the data array. The address associated with the instruction bytes is then stored in the tag array.
Instruction bytes are read from main memory and then stored in the instruction cache until they are needed. In some embodiments, microprocessors may xe2x80x9cpredecodexe2x80x9d the instruction bytes before they are stored in the instruction cache. Predecoding typically involves identifying the boundaries between consecutive instructions and possibly identifying the opcode bytes within the instruction. This predecode information is typically stored with the instruction bytes in the instruction cache. When instructions are fetched from the instruction cache, the predecode information is used to speed the alignment and decoding of the instructions.
After a requested instruction address is output to main memory, a predetermined number of sequential instruction bytes beginning at the requested address are read from main memory, predecoded, and then conveyed to the instruction cache for storage. The instruction bytes are stored into storage locations (xe2x80x9ccache linesxe2x80x9d) according to their address, typically without regard to what types of instructions are contained within the sequence of instruction bytes.
One drawback, however, of traditional caches is that they suffer from inefficiencies because branch instructions and branch targets do not naturally occur at cache line boundaries. This may deleteriously affect performance because taken branch instructions residing in the middle of a cache line may cause the end portion of the cache line to be discarded when it is fetched. Furthermore, branch targets that are not located at the start of a cache line may similarly cause the beginning portion of the cache line to be discarded. For example, upon receiving a fetch address, the typical instruction cache reads the entire corresponding cache line, and then selection logic (either internal or external to the instruction cache) selects the desired instructions and discards instruction bytes before the target address and or after a branch instruction.
In addition to discarding fetched instruction bytes, an additional performance penalty results from the alignment used before the instruction bytes can be properly decoded. While the cache-related problems highlighted above may occur in both RISC and x86 instruction sets, the problems are typically aggravated by the variable-length nature of x86 instructions.
Thus, a method and apparatus for more easily accessing instruction bytes stored in a cache is desired. In addition, a method that would improve the cache performance of both RISC microprocessors and x86 compatible microprocessors would be particularly desirable.
The problems outlined above may in part be solved by a cache memory configured to access stored instructions according to basic blocks. Instruction streams have natural divisions that are determined by branches. These divisions are referred to herein as xe2x80x9cbasic blocksxe2x80x9d, with the start of a basic block being the target of a branch, and the end being another (taken) branch instruction. Thus, a method for caching instructions in a block oriented manner rather than the conventional power-of-2 memory blocks is contemplated.
In one embodiment, the method comprises receiving instruction bytes corresponding to a fetch address and decoding the instruction bytes into instructions. Next, basic blocks of instructions are formed by grouping the instructions into basic blocks ending with branch instructions. The basic blocks may be padded with NULL instructions if the basic blocks have less than a predetermined number of instructions. Conversely, the basic blocks may be divided into two or more basic blocks if the basic blocks have more than the predetermined number of instructions. Once formed, the basic blocks are stored into a basic block cache. Pointers corresponding to the basic blocks are stored into a basic block sequence buffer. The pointers are stored with branch prediction information to form predicted sequences of basic blocks which are output by the sequence buffer when it receives a corresponding fetch address. Multiple basic block pointers may be output and fetched from the basic block cache in a particular clock cycle.
A microprocessor configured to cache basic blocks of instructions is also contemplated. In one embodiment, the microprocessor comprises a basic block cache and a basic block sequence buffer. The basic block cache is configured to store basic blocks, wherein each basic block may comprise a number of instructions and may end with a branch instruction. The basic block sequence buffer comprises a plurality of storage locations, each configured to store a block sequence entry. The block sequence entry has an address tag and one or more basic block pointers. The address tag corresponds to the fetch address of a particular basic block, and the pointers point to basic blocks that follow that particular basic block in a predicted order. Each block sequence entry may contain multiple basic block pointers and branch prediction information to select the basic block that is predicted to follow the block corresponding to the address tag.
A computer system configured to utilize a basic block oriented cache is also disclosed. In one embodiment, the system comprises a microprocessor having a basic block cache and a basic block sequence buffer. The basic block cache and basic block sequence buffer may be configured as described above. A CPU bus may be coupled to the microprocessor, and a modem may be coupled to the CPU bus via a bus bridge.