1. Field of the Invention
This invention relates to microprocessors and, more particularly, to an apparatus for performing store memory accesses in a microprocessor.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions during a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. Storage devices (e.g. registers or arrays) capture their values in response to a clock signal defining the clock cycle. For example, storage devices may capture a value in response to a rising or falling edge of the clock signal.
Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). However, superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data, superscalar microprocessors are often configured with caches. Caches are storage devices containing multiple blocks of storage locations, configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes. Each block of storage locations stores a set of contiguous bytes, and is referred to as a cache line. Typically, cache lines are transferred to and from the main memory as a unit. Bytes can be transferred from the cache to the destination (a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory.
Caches may be organized into an "associative" structure (also referred to as "set associative"). In a set associative structure, the cache lines are accessed as a two-dimensional array having rows and columns. When a cache is searched for bytes residing at an address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the cache. The act of selecting a row via an index is referred to as "indexing". The addresses associated with bytes stored in the multiple cache lines of a row are examined to determine if any of the addresses match the requested address. If a match is found, the access is said to be a "hit", and the cache provides the associated bytes. If a match is not found, the access is said to be a "miss". When a miss is detected, the bytes are transferred from the memory system into the cache. The addresses associated with bytes stored in the cache are also stored. These stored addresses are referred to as "tags" or "tag addresses".
The cache lines within a row form the columns of the row. Columns may also be referred to as "ways". The column is selected by examining the tags from a row and finding a match between one of the tags and the requested address. A cache designed with one column per row is referred to as a "direct-mapped cache". In a direct-mapped cache, the tag must be examined to determine if an access is a hit, but the tag examination is not required to select which bytes are transferred to the outputs of the cache.
In addition to employing caches, superscalar microprocessors often employ speculative execution to enhance performance. An instruction may be speculatively executed if the instruction is executed prior to determination that the instruction is actually to be executed within the current instruction stream. Other instructions which precede the instruction in program order may cause the instruction not to be actually executed (i.e. a mispredicted branch instruction or an instruction which causes an exception). If an instruction is speculatively executed and later determined to not be within the current instruction stream, the results of executing the instruction are discarded. Unfortunately, store memory accesses are typically not performed speculatively. As used herein, a "memory access" refers to a transfer of data between one or more main memory storage locations and the microprocessor. A transfer from memory to the microprocessor (a "read") is performed in response to a load memory access. A transfer from the microprocessor to memory (a "write") is performed in response to a store memory access. Memory accesses may be a portion of executing an instruction, or may be the entire instruction. A memory access may be completed internal to the microprocessor if the memory access hits in the data cache therein. As used herein, "program order" refers to the sequential order of instructions specified by a computer program.
While speculative load memory accesses are often performed, several difficulties typically prevent implementation of speculative store memory accesses. As opposed to registers which are private to the microprocessor, memory may be shared with other microprocessors or devices. Although the locations being updated may be stored in the data cache, the data cache is required to maintain coherency with main memory. In other words, an update performed to the data cache is recognized by other devices which subsequently access the updated memory location. Other devices must not detect the speculative store memory access, which may later be canceled from the instruction processing pipeline due to incorrect speculative execution. However, once the store becomes non-speculative, external devices must detect the corresponding update. Additionally, speculative loads subsequent to the speculative store within the microprocessor must detect the updated value even while the store is speculative.
Instead of speculatively performing store memory accesses, many superscalar microprocessors place the store memory accesses in a buffer. When the store memory accesses become non-speculative, they are performed. Load memory accesses which access memory locations updated by a prior store memory access may be stalled until the store memory access completes, or may receive forwarded data from the store memory access within the buffer. Even when forwarding is implemented, the load memory access is stalled for cases in which the load memory access is not completely overlapped by the store memory access (i.e. the load memory access also reads bytes which are not updated by the store memory access). Buffer locations occupied by stores and loads which depend upon those stores are not available to subsequent memory accesses until the store is performed. Performance of the microprocessor is thereby decreased due to the inability to perform speculative store memory accesses. An apparatus allowing speculative performance of store memory accesses while ensuring correct operation is desired.