1. Field
The present disclosure relates to computer processors (also commonly referred to as CPUs).
2. State of the Art
A computer processor (and the program which it executes) needs places to put data for later reference. A computer processor design will typically have many such places, each with its own trade off of capacity, speed of access, and cost. Usually these are arranged in a hierarchal manner referred to as the memory system of the processor, with small, fast, costly places used for short lived small data and large, slow and cheap places used for what doesn't fit in the small, fast, costly places. The memory system typically includes the following components arranged in order of decreasing speed of access:
register file or other form of fast operand storage;
one or more levels of cache memory (one or more levels of the cache memory can be integrated with the processor (on-chip cache) or separate from the processor (off-chip cache);
main memory (or physical memory), which is typically implemented by DRAM memory and/or NVRAM memory and/or ROM memory;
controller card memory; and
on-line mass storage (typically implemented by one or more hard disk drives).
In many computer processors, the main memory of the memory system can take several hundred machine cycles to access. The cache memory, which is much smaller and more expensive but with faster access as compared to the main memory, is used to keep copies of data that resides in the main memory. If a reference finds the desired data in the cache (a cache hit) it can access it in a few machine cycles instead of several hundred when it doesn't (a cache miss). Because a program typically has nothing else to do while waiting to access data in memory, using a cache and making sure that desired data is copied into the cache can provide significant improvements in performance.
The cache granularity (the cache line) is chosen to optimize the transfer of data from external memory to and from cache memory. Typical cache line sizes are 32 or 64 bytes, significantly larger than the granularity of program access to data, which is commonly one to eight bytes.
The mismatch of granularity is not usually significant for loads of data. If the desired data is not found in cache, then the whole containing line is brought in from external memory and the load is satisfied from the relevant portion of the line. A subsequent load may reference a different part of the line and be satisfied rapidly from cache without another access to external memory. Similarly, a store to a location that is already resident in cache may be performed quickly by updating the cache line, without sending the new data values to the external memory.
However, stores to lines that are not cache resident (write misses) present a problem. If a store miss allocates a new line in cache and updates it with the stored value then the granularity disparity means that there will be unwritten bytes in the line. Such remaining unwritten bytes of the line have undefined value, and a subsequent load to the undefined portion would not return a correct value to the CPU core.
There are two well-known methods to avoid this write-miss problem. In the write-through method, all stores that do not hit in cache are sent to external memory without allocating a cache line, and cache lines are only allocated by a load. In the write-back method, store misses cause the target line to be read from external memory in the same way as a load, whereupon it can be updated with the stored value as if there had been no miss.
Each of these two methods can cause the program to incur significant costs. In the write-through method, multiple write misses to the same line increases traffic to external memory as each is written through. The extra traffic may be avoided by use of buffers that combine multiple stores to the same line, but then these must be checked in the same way as is needed for the write-back method, with the same power and complexity costs. In the write-back method case, the store value must be buffered until the desired line is read from external memory, and the buffer must be checked by subsequent loads and stores to provide semantically consistent behavior in the case of overlapping access; the buffering and checking is expensive in power