1. Field of the Invention
The present invention relates generally to the design of a cache memory and, more particularly, to a system and method of scoreboarding individual cache line segments in order to reduce the penalty associated with a cache store-miss.
2. Discussion of Related Art
Most modern computer systems include some type of memory hierarchy. The memory hierarchy normally consists of many levels but is typically managed between only two adjacent levels at any one time. The upper level (the one closer to the processor) is smaller and faster than the lower level. The minimum unit of information that can be either present or not present in the two-level hierarchy is called a block (or a line when discussing caches).
Success or failure of an access to the upper level is designated as a hit or a miss, respectively. A hit is a successful memory access to the upper level, while a miss means that the desired data is not found in that level. Misses that are associated with store (write) instructions are referred to as store-misses and misses that are associated with load (read) instructions are referred to as load-misses.
Since performance is the major reason for having a memory hierarchy, the speed of hits and misses is important. Hit time is the time it takes to access the upper level of the memory hierarchy, which includes the time required to determine whether the access is a hit or a miss. Miss penalty is the time required to replace a block in the upper level with a corresponding block from the lower level, plus the time required to deliver this block to the requesting device (normally the processor).
Cache memories are high-speed memories that are placed between microprocessors and main memories. They store copies of main memory that are currently in use in order to speed microprocessor access to requested data and instructions. Caches appear today in every class of computer and in some computers more than once. In order to achieve the speed necessary to aid in microprocessor performance, cache memories are typically built using fast static random access memory circuits (SRAMs). Cache systems typically include a data cache (D-cache) and an instruction cache (I-cache).
Cache memories provide rapid access to frequently used instructions and data thorough load and store instructions. Cache memories communicate to main memory and other caches through "miss transactions." A miss transaction occurs when an instruction generates a cache miss, i.e., when the processor attempts to retrieve data that is not present in the cache memory.
The main advantage of using a cache is that a larger, relatively slow main memory can be made to emulate the high speeds of a cache. When properly implemented, a cache memory can typically have an access time which is three to twenty times faster than that of main memory, thus reducing the overall memory access time. Caches also reduce the number of accesses to main memory. This is especially important in systems with multiple processor's which all compete for access to a common memory.
Cache memories are important in increasing computer performance by reducing total memory latency. A cache memory typically consists of a directory (or tag) and a data memory. Whenever the CPU is required to read or write data, it first accesses the tag to determine whether the data is present in the memory. If the requested word is present in the cache then a cache hit occurs. If the tags do not match, then the data word is not present in the cache. This is called a cache miss. On a cache hit, the cache data memory allows a read operation to be completed quicker than a slower main memory access. The hit rate is the percentage of accesses to the cache that are hits, and is affected by the size and organization of the cache, the cache algorithm used, and the program which is running. An effective cache system should maintain data in a way that maximizes the hit rate.
The servicing of a cache miss is conventionally handled by making room in the data cache for the new data to be input, fetching the data from main memory, and then storing the data in the data cache. Storing the requested data into the data cache is also referred to as a "miss copy-in."
Caches can frequently be categorized according to the store (write) policies which they employ. There are two basic store options employed by caches:
(1) Write-through (or store-through)--Information is written to both the line in the cache and to the block in the lower-level memory. PA1 (2) Write-back (also called copy-back, store-in)--Information is written only to the line in the cache. The modified cache line is written to main memory only when the cache line is replaced. PA1 (1) Write-allocate (also called fetch on write)--The block is loaded from main memory into the cache, followed by the write-hit actions outlined above; or PA1 (2) No write-allocate (also called write around)--The block is modified in the lower level and not loaded into the cache.
Write-back cache blocks may be classified as either clean or dirty, depending on whether the information in the cache differs from that in lower-level memory. To help maintain the integrity of the data in the lower level memory and to reduce the frequency of writing clean blocks back to lower level memory, a feature called the dirty bit is commonly used. The dirty bit is a status bit which indicates whether or not the block was modified while in the cache. If it wasn't modified, then the block does not need to be written back to the lower level memory, since the lower level memory has the same information as the cache.
Both write-back and write-through policies have their advantages. With write-back, writes occur at the speed of the cache memory, and multiple writes within a line require only one write to the lower-level memory. Since every write doesn't go to lower level memory, write back uses less memory bandwidth, making write-back attractive in multiprocessor environments. With write-through, read misses don't result in writes to the lower level and write-through is easier to implement than write-back. Write through also has the advantage that main memory always has a current copy of the data. This is important in multiprocessor environments. Hence, multiprocessors want write-back in order to reduce the memory traffic per processor and write-though to keep the cache and memory consistent.
There are two conventional options that can be taken on a store-miss:
While either store-miss policy could be used with write-through or write-back, generally write-back caches use write-allocate (hoping that subsequent writes to that block will be captured by the cache) and write-through caches often use no write-allocate (since subsequent writes to that block will still have to go to memory).
If a processor program requires data to continue its processing stream and that data is not yet available, a condition known as a "stall" will occur. A stall is a period of time that a processor is idle while some peripheral subsystem (e.g., the main or cache memories) are busy acquiring the critical data that caused the stall. In general, load misses and store misses may force a processor to stall.
When a processor encounters a store miss, a line in cache is selected to be displaced (overwritten) by the line in main memory that is referenced by the store miss. The processor then enters a stall state and a store miss transaction is initiated while the line is copied from main memory. When the store miss completes its task, the CPU is able to post the required store data and continue processing.
It is possible to defer the stall in the above scenario if a dedicated local register is provided to which the processor can temporarily post a missed store and thus defer the stall. The missed line from memory can then be combined with the data from the local register at a later time to preserve cache consistency (i.e., ensure that the cache has the most recent data). Using this scheme, the processor can defer the stall for a store miss until a subsequent load requests data from the missed line or a subsequent attempt to store to the same line is made. If the store miss completes before either of those two events occur, then the stall is avoided altogether. This functionality is often referred to as "stall-on-use" or "hit-under-miss."
For a more in depth discussion of cache memory design and operation, see Hennessy et al., Computer Architecture a Quantitative Approach, Morgan Kaufmann Publishers (1990) which is incorporated by reference in its entirety herein. Portions of Hennessy et al. have been reproduced above for the convenience of the reader.
As outlined above, conventional cache systems do not allow store operations to a missing cache line to execute until the missing line returns from memory. Thus, what is needed is a mechanism that improves cache performance by allowing stores which miss the cache to complete in advance of the miss copy-in from memory.