1. Technical Field
The present invention relates in general to data processing systems, and more particularly, to an improved multi-processor data processing system. Still more particularly, the present invention relates to improved cache operation within multi-processor data processing systems.
2. Description of the Related Art
A conventional multi-processor data processing system (referred hereinafter as an MP), typically includes a system memory, input/output (I/O) devices, multiple processing elements that each include a processor and one or more levels of high-speed cache memory, and a system interconnect coupling the processing elements to each other and to the system memory and I/O devices. Though most multiprocessor systems utilize a unified system memory and common bus type interconnect, generalized interconnects with non-uniform memory access (NUMA) system memory configurations are common to high performance systems. The processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies.
Caches are commonly utilized to temporarily store values that might be accessed by a processor in order to speed up processing by reducing access latency as compared to loading needed values from memory. Each cache includes a cache array and a cache directory. An associated cache controller manages the transfer of data and instructions between the processor core or system memory and the cache. Typically, the cache directory also contains a series of bits utilized to track the coherency states of the data in the cache.
With multiple caches within the memory hierarchy, a coherency structure (e.g., a page frame table) is required for valid execution results in the MP. This coherency structure provides a single view of the contents of the memory to all of the processors and other memory access devices (e.g., I/O devices). A coherent memory hierarchy is maintained through the utilization of a coherency protocol, such as the MESI protocol. In the MESI protocol, an indication of a coherency state is stored in association with each coherency granule (e.g., a cache line or sector) of one or more levels of cache memories. Each coherency granule can have one of the four MESI states, which is indicated by bits in the cache directory.
The MESI protocol allows a cache line of data to be tagged with one of four states: “M” (modified), “E” (exclusive), “S” (shared), or “I” (invalid). The Modified state indicates that a coherency granule is valid only in the cache storing the modified coherency granule and that the value of the modified coherency granule has not been written to system memory. When a coherency granule is indicated as Exclusive, only that cache holds the data, of all the caches at that level of the memory hierarchy. However, the data in the Exclusive state is consistent with system memory. If a coherency granule is marked as Shared in a cache directory, the coherency granule is resident in the associated cache and possibly in at least one other, and all of the copies of the coherency granule are consistent with system memory. Finally, the Invalid state indicates that the data and address tag associated with a coherency granule are both invalid.
The state to which each coherency granule (e.g., cache line or sector) is set is dependent upon both a previous state of the data within the cache line and the type of memory access request received from a requesting device (e.g., a processor). Accordingly, maintaining memory coherency in the MP requires that the processors communicate messages across the system bus indicating their intention to read or write to memory locations. For example, when a processor desires to write data to a memory location, the processor must first inform all other processing elements of its intention to write data to the memory location and receive permission from all other processing elements to carry out the write operation. The permission messages received by the requesting processor indicate that all other cached copies of the contents of the memory location have been invalidated, thereby guaranteeing that the other processors will not access their stale local data.
In some MP systems, the cache hierarchy includes multiple levels. The level one (L1) cache is usually a private cache associated with a particular processor core in the MP system. The processor core first looks for data in the L1 cache. If the requested data block is not in the L1 cache, the processor core then accesses the level two (L2) cache. This process continues until the all levels in the cache hierarchy are referenced, before accessing system memory. Some of the cache levels (e.g., the level three or L3 cache) may be shared by multiple caches of other levels of the hierarchy (e.g., L3 cache may be shared by multiple L2 caches). Generally, as the size of a cache increases, its speed decreases accordingly. Therefore, it is advantageous for system performance to keep data in L1 and L2 caches whenever possible.
Some modern MP systems employ a victim cache approach in the upper levels of the memory hierarchy. Implementing the L3 cache memory as a victim cache enables better utilization of cache capacity. Furthermore, since the L3 cache memory does not need to store all the contents of the L2 cache memories, an L3 victim cache defines a separate data path between the processor core and system memory. This configuration can better accommodate the bandwidth requirements of multi-core processors and increased numbers of hardware-managed program threads.
Typically, when a congruence class storage location is needed in one of the L1 or L2 caches, the lines of the data to be replaced are “evicted” or written to another level cache for storage. However, in an MP system with a hierarchy having multiple separate caches, there may be several copies of the same data residing in the memory hierarchy at the same time. The policy of evicting lines to provide for more space in the L1 and L2 caches may result in unnecessary writes to other level caches, which necessitates increased bus and cache bandwidth. Invalidating the data in a needed storage location may be more efficient.
However, invalidating rather than evicting does not always result in system performance gains. Therefore, there is a need for a system and method which dynamically detects conditions in a data processing system, and the uses of that information to tailor memory hierarchy eviction and invalidation practices to maximize system performance.