1. Technical Field
The present invention relates generally to data processing systems and specifically to memory access operations within data processing systems. Still more particularly, the present invention relates to the reduction of cache pollution in a data processing system.
2. Description of the Related Art
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Cache memories are commonly utilized to temporarily buffer memory blocks that might be accessed by a processor in order to speed up processing by reducing access latency introduced by having to load needed data and instructions from memory. In some multiprocessor (MP) systems, the cache hierarchy includes at least two levels. The level one (L1), or upper-level cache is usually a private cache associated with a particular processor core and cannot be accessed by other cores in an MP system. Typically, in response to a memory access instruction such as a load or store instruction, the processor core first accesses the upper-level cache. If the requested memory block is not found in the upper-level cache or the memory access request cannot be serviced in the upper-level cache (e.g., the L1 cache is a store-though cache), the processor core then access lower-level caches (e.g., level two (L2) or level three (L3) caches) to service the memory access to the requested memory block. The lowest level cache (e.g., L2 or L3) is often shared among several processor cores.
A coherent view of the contents of memory is maintained in the presence of potentially multiple copies of individual memory blocks distributed throughout the computer system through the implementation of a coherency protocol. The coherency protocol, for example, the well-known Modified, Exclusive, Shared, Invalid (MESI) protocol, entails maintaining state information associated with each cached copy of the memory block and communicating at least some memory access requests between processing units to make the memory access requests visible to other processing units.
When executing in such conventional computer systems, streaming applications commonly write contiguous data words into large arrays without frequent reuse of the store data, leading to “pollution” of the cache hierarchy as the array data of the streaming application displaces other data from the caches. For example, a streaming application may execute code that performs the following function:for (=0;i<N;i++)C[i]=A[i]+B[i]where N is a large integer. Such code generally writes a large amount (i.e., N words) of contiguous memory locations for array C, generally leading to the casting out or deallocation of a substantial amount of data that may soon be accessed again in favor of other data that is unlikely to soon be accessed. Even in cases in which the memory allocated to array C is not contiguous, all bytes in nearly all memory blocks belonging to array C are overwritten, displacing potentially useful data that may subsequently need to be reloaded into the cache.