Many data processing systems include both a system memory (also referred to as a main memory) and a cache memory (or simply referred to as cache). A cache memory is a relatively high-speed memory that stores a copy of information that is also stored in one or more portions of the system memory. The cache memory can be integrated within a processor (on-chip or on-die) of the data processing system or remain separate from the processor of data processing system. There generally are multiple levels of cache memory with progressively faster speeds and smaller sizes. Commonly the largest level of cache is called Level 3 cache, the next largest is Level 2 cache, and the smallest is Level 1 cache. The Level 1 cache is generally on-die with the CPU, is very small (e.g., commonly only 16 to 32 kilobytes in size), and can not be bypassed. The Level 2 cache is generally also on-die and will be a larger size (commonly 256 kilobytes) and also can not be bypassed. The level 3 cache is the largest cache, commonly anywhere from 512 kilobytes to as much as 8 megabytes. For the remainder of this application, when a reference is made to bypassing cache memory and avoiding cache pollution, it is the highest level of cache, typically Level 3 cache, that is being referred to.
Most applications passively benefit from cache memory of a data processing system in order to speed up their performance. However, a cache memory is relatively expensive and typically small in size. Furthermore, cache memory is only of benefit when it is used to store data that is accessed multiple times. If data that will only be accessed once is loaded into cache memory, the benefit of cache memory is not utilized as the initial load of data into cache, which happens automatically on first access of the data, is no faster than main memory. It is only the second and subsequent accesses of the data that benefit from the higher speed of the cache memory. Due to the relatively small size of cache memory, very shortly after start up of a data processing system (usually within milliseconds of initial boot up), the cache memory will already be full. From that point on, every load of data into cache memory requires the eviction of some other piece of data from cache memory.
Because shortly after startup the cache memory of a system is already full and all subsequent loads of data into cache evict something else, the term “cache pollution” was coined to signify those times when a useful piece of data that benefits from being in the cache memory is evicted by a piece of data that will only be used once and will not benefit from being in cache memory.
Since an application suffers a significant performance penalty if its commonly accessed data is the victim of cache pollution, a significant performance benefit can be utilized by instructing the data processing system not to load data that will only be accessed a single time into the cache memory and thereby avoid the cache pollution. The technique to do this is commonly called a prefetch operation. The prefetch instruction causes the processing unit of the data processing system to attempt to retrieve the data without polluting the cache memory.
It is important to note that if the data must be written back to main memory, and it was loaded using a prefetch cycle and cache pollution was successfully avoided, then the data must also be written back to main memory using a non cache polluting technique. A normal write of the data will send the data through all levels of the cache memory, thereby undoing the benefit gained in avoiding the cache memory on the data load cycle. The technique commonly used to do this is to substitute a non cache polluting processor instruction in place of the typical instruction. Eg: MOVNTPS for MOVAPS where NT stands for Non Temporal and is Intel specific nomenclature for data you know will not be used again and should not be allowed to pollute the cache.
The usage of the prefetch operation is not without possible pitfalls. The prefetch operation is a non-blocking CPU operation, meaning that if it does not succeed by the time the data being prefetched is needed by the CPU, the CPU will not consider the condition an error and will continue with operation. However, if the data is not present when the CPU needs it, then the normal data retrieval mechanism in the CPU is immediately triggered and the data will be loaded into all levels of cache memory despite the prefetch operation. This negates the effect of the prefetch operation entirely.
In addition, even though the prefetch operation is non-blocking and will not stall the CPU, it is not free in terms of CPU cycles, memory bus cycles, or other CPU resources. This means that any time a prefetch operation is attempted but the data is not retrieved before the CPU is ready for it, the net effect is that you lose the benefit of the prefetch but still consume the resources of the prefetch. As a result, large numbers of failed prefetch operations actually have a negative impact on overall system performance compared to not attempting any prefetch operations at all and simply allowing transient data to pollute all levels of the cache memory. For this reason it is important that the prefetch operation be issued sufficiently early so that it can complete prior to the CPU being ready for the data being prefetched.
Similarly, if the data is prefetched successfully, but not used very soon, it can end up being evicted from the level of cache it was prefetched to by the time the CPU is ready for it. This is also not an error condition, and this also triggers the CPU's normal memory load cycle. However, this condition is even worse then a prefetch that hasn't completed yet in the sense that this prefetch completed and was subsequently thrown away and then the normal CPU memory load cycle was performed, so the memory being prefecthed was read from main memory into cache twice, effectively doubling the load on the bus between main memory and the CPU.
Since it is important that the prefetch operation be performed at the optimal time relative to when the CPU will need the data being prefetched, applications that use the prefetch technique are optimized to find the best point in time to prefetch data. However, the optimal interval varies due to several factors: memory load at the time the prefetch operation is performed, ratio of main memory speed to CPU speed, memory controller in use, etc. Most (if not all) applications simply hard code what they found an optimal prefetch interval to be on their specific test platform(s) under static loading conditions. Failure to account for varying system conditions and adapt to those conditions at run time can cause even well tuned prefetch operations to fail more often than they succeed.
The use of cache bypassing writes is also not without pitfalls. While a prefetch operation is not CPU blocking, a cache bypassing write is. CPUs commonly have a queue to put cache bypassing writes into, if that queue is full when a cache bypassing write is issued, the CPU will stall until a spot frees up in the queue. How fast that queue can drain entries is dependent on the same factors that effect prefetch latency intervals. If a write is instead issued using the normal instruction, then it will go into cache memory immediately and will not stall the CPU. In addition, the cache memory is able to flush contents back to main memory when there is available memory bandwidth, thereby making use of memory bandwidth that would have otherwise gone un-utilized. This increases overall memory efficiency.
Existing systems use one of two methods for dealing with software raid xor operations. They either do not use any cache bypass methods at all, or they use both cache bypassing loads and cache bypassing stores. And when they do use cache bypassing operations, they do not use run time optimized values, they use statically tuned prefetch operations.
Further, when data is loaded into cache memory, it's location in cache memory is directly dependent upon it's location in main memory as there is a mapping function in the cache memory controller that maps from main memory addresses to cache memory sets. Cache memory sets are referred to as N-way, where N is a number between a minimum of 2 and currently a high of 24. 4-way, 8-way, and 16-way set associative caches are the most commonly used cache memory currently. An 8-way set associative cache memory will have 8 slots in a given set to hold data in cache memory. Since all memory addresses map directly to a set, any two addresses that map to the exact same set are called aliases of each other. The cache can never hold more than N aliases at a time. If memory addresses A and B are aliases for each other, and A is already loaded into cache memory, then when the CPU loads address B into cache it would map to the same cache set as address A and it's value would possibly replace A in cache memory, thereby evicting A from cache. Applications commonly attempt to avoid allocating memory at alias offsets, especially if they anticipate ever wanting to use both pieces of memory at the same or similar points in time. No software xor implementations today make use of cache alias effects to reduce the impact of their cache pollution.