1. Field of the Invention
This invention is related to the field of superscalar microprocessors and, more particularly, to data caches capable of performing store accesses in a single clock cycle within high frequency superscalar microprocessors.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by simultaneously executing multiple instructions in a clock cycle and by specifying the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time during which the pipeline stages of a microprocessor perform their intended functions. At the end of a clock cycle, the resulting values are moved to the next pipeline stage.
Since superscalar microprocessors execute multiple instructions per clock cycle and the clock cycle is short, a high bandwidth memory system is required to provide instructions and data to the superscalar microprocessor (i.e. a memory system that can provide a large number of bytes in a short period of time). Without a high bandwidth memory system, the microprocessor would spend a large number of clock cycles waiting for instructions or data to be provided, then would execute the received instructions and/or the instructions dependent upon the received data in a relatively small number of clock cycles. Overall performance would be degraded by the large number of idle clock cycles. However, superscalar microprocessors are ordinarily configured into computer systems with a large main memory composed of dynamic random access memory (DRAM) cells. DRAM cells are characterized by access times which are significantly longer than the clock cycle of modern superscalar microprocessors. Also, DRAM cells typically provide a relatively narrow output bus to convey the stored bytes to the superscalar microprocessor. Therefore, DRAM cells provide a memory system that provides a relatively small number of bytes in a relatively long period of time, and do not form a high bandwidth memory system.
Because superscalar microprocessors are typically not configured into a computer system with a memory system having sufficient bandwidth to continuously provide instructions and data, superscalar microprocessors are often configured with caches. Caches are storage devices containing multiple blocks of storage locations, configured on the same silicon substrate as the microprocessor or coupled nearby. The blocks of storage locations are used to hold previously fetched instruction or data bytes. The bytes can be transferred from the cache to the destination (a register or an instruction processing pipeline) quickly; commonly one or two clock cycles are required as opposed to a large number of clock cycles to transfer bytes from a DRAM main memory.
Caches may be organized into an "associative" structure (also referred to as "set associative"). In an associative structure, the blocks of storage locations are accessed as a two-dimensional array having rows and columns. When a cache is searched for bytes residing at an address, a number of bits from the address are used as an "index" into the cache. The index selects a particular row within the two-dimensional array, and therefore the number of address bits required for the index is determined by the number of rows configured into the cache. The act of selecting a row via an index is referred to as "indexing". The addresses associated with bytes stored in the multiple blocks of a row are examined to determine if any of the addresses stored in the row match the requested address. If a match is found, the access is said to be a "hit", and the cache provides the associated bytes. If a match is not found, the access is said to be a "miss". When a miss is detected, the bytes are transferred from the memory system into the cache. The addresses associated with bytes stored in the cache are also stored. These stored addresses are referred to as "tags" or "tag addresses".
The blocks of memory configured into a row form the columns of the row. Each block of memory is referred to as a "way"; multiple ways comprise a row. The way is selected by providing a way value to the cache. The way value is determined by examining the tags for a row and finding a match between one of the tags and the requested address. A cache designed with one way per row is referred to as a "direct-mapped" cache. In a direct-mapped cache, the tag must be examined to determine if an access is a hit, but the tag examination is not required to select which bytes are transferred to the outputs of the cache. Since only an index is required to select bytes from a direct-mapped cache, the direct-mapped cache is a "linear array" requiring only a single value to select a storage location within it.
Both direct-mapped and associative caches are employed in high frequency (i.e. short clock cycle) superscalar microprocessors. In high frequency applications, set associative caches either become a clock cycle limiter or require multiple clock cycles to complete a cache access because the comparison of tags to the request address and the subsequent selection of data bytes to convey to the output requires more time than the desired clock cycle time allows. The term cache bandwidth is used to describe the number of cache requests which may be made per clock cycle. Cache bandwidth decrease due to multiple clock cycle accesses is particularly detrimental to superscalar microprocessors attempting to execute multiple instructions (which may require cache access) per clock cycle. Longer clock cycles are also detrimental to the performance of a microprocessor, since fewer instructions may be executed per second.
Direct-mapped caches, which compare the selected tag to the request address in parallel with conveying data bytes to the output, operate in less time than the associative cache. Unfortunately, direct-mapped caches are associated with lower hit rates (i.e. the percentage of accesses that are hits) than associative caches with a similar storage capacity. Furthermore, direct-mapped caches are more susceptible to "thrashing". Thrashing is a phenomenon that occurs when the pattern of address requests presented to the cache contains several dissimilar addresses with the same index. Dissimilar addresses are addresses that are stored in the cache with different tags. As an illustrative example, addresses A and B may access the cache alternately and repeatedly. Address A and address B have the same index, and access a direct-mapped cache. First, address A accesses the cache and misses. The indexed cache storage location is filled with bytes associated with address A. Next, address B accesses the cache and misses. The indexed cache storage location discards the bytes associated with address A and is filled with bytes associated with address B. Address A accesses the cache again, and misses. The cache storage location discards the bytes associated with address B and is filled with bytes associated with address A. An associative cache would be able to store bytes associated with both address A and address B simultaneously.
An additional problem associated with both direct-mapped and associative data caches in high frequency applications is the execution of store instructions. A store instruction causes bytes to be stored at a memory location. Unlike load instructions, in which data might be speculatively forwarded to the target (a register, for example) and validated as the correct data in a later clock cycle, store instructions typically cannot speculatively store their operand data into the cache and validate that the access was a cache hit in a later clock cycle. If the speculation is incorrect, data within the cache will have been destroyed. Instead, when a store access to a cache is made, the cache is searched to determine if data associated with the store address is contained within the cache. If the store address is a hit, then the data bytes may be stored into the cache in the storage location currently associated with the store address. If the store address is a miss, data bytes associated with the store address must first be transferred into the cache from main memory. Unfortunately, searching for a hit first and then storing the data bytes has often required two clock cycles of cache access for each store instruction executed in high frequency microprocessors. During the first clock cycle of a store access, the address of the store instruction is determined to hit or miss in the cache. During the second clock cycle, the data associated with the store instruction is stored into the cache. It would be advantageous to perform store accesses to the cache in one clock cycle. Additionally, such a storage device having the access time of a direct-mapped cache with the hit rate and insensitivity to thrashing of an associative cache is desired.