1. Field of the Invention
This invention relates to high performance computing network systems, and more particularly, to maintaining efficient cache coherency across multi-processor nodes.
2. Description of the Relevant Art
In modern microprocessors, one or more processor cores, or processors, may be included in the microprocessor, wherein each processor is capable of executing instructions in a superscalar pipeline. The microprocessor may be coupled to one or more levels of a cache hierarchy in order to reduce the latency of the microprocessor's request of data in memory for a read or a write operation. Generally, a cache may store one or more blocks, each of which is a copy of data stored at a corresponding address in the system memory. As used herein, a “block” is a set of bytes stored in contiguous memory locations, which are treated as a unit for coherency purposes. In some embodiments, a block may also be the unit of allocation and deallocation in a cache. The number of bytes in a block may be varied according to design choice, and may be of any size. As an example, 32 byte and 64 byte blocks are often used.
In order to increase computing performance, a computing system may increase parallel processing by comprising subsystems such as processing nodes, each node including one or more microprocessors. Each microprocessor within a processing node, or node, may have its own cache hierarchy. Also, each node may have a higher level of cache hierarchy shared among multiple microprocessors. For example, in one embodiment, a node may comprise two microprocessors, each with a corresponding level one (L1) cache. The node may have an L2 cache shared by the two microprocessors. A memory controller or other interface may couple each node to other nodes in the computing system, to a higher level of cache hierarchy, such as a L3 cache, shared among the multiple nodes, and to dynamic random-access memory (DRAM), dual in-line memory modules (dimms), a hard disk, or otherwise. In alternative embodiments, different variations of components and coupling of the components may be used.
Since a given block may be stored in one or more caches, and further since one of the cached copies may be modified with respect to the copy in the memory system, computing systems often maintain coherency between the caches and the memory system. Coherency is maintained if an update to a block is reflected by other cache copies of the block according to a predefined coherency protocol. Various specific coherency protocols are well known.
Many coherency protocols include the use of messages, or probes, passed from a coherency point, such as a memory controller, to communicate between various caches within the computing system. A coherency point may transmit probes in response to a command from a component (e.g. a processor) to read or write a block. Probes may be used to determine if the caches have a copy of a block and optionally to indicate the state into which the cache should place the block. Each probe receiver responds to the probe, and once all probe responses are received the command may proceed to completion.
Computer systems generally employ either a broadcast cache coherency protocol or a directory based cache coherency protocol. In a system employing a broadcast protocol, probes are broadcast to all processors (or cache subsystems). When a subsystem having a shared copy of data observes a probe resulting from a command for exclusive access to the block, its copy is typically invalidated. Likewise, when a subsystem that currently owns a block of data observes a probe corresponding to that block, the owning subsystem typically responds by providing the data to the requester and invalidating its copy, if necessary.
In contrast, systems employing directory based protocols maintain a directory containing information indicating the existence of cached copies of data. Rather than unconditionally broadcasting probes, the directory information is used to determine particular subsystems (that may contain cached copies of the data) to which probes need to be conveyed in order to cause specific coherency actions. For example, the directory may contain information indicating that various subsystems contain shared copies of a block of data. In response to a command for exclusive access to that block, invalidation probes may be conveyed to the sharing subsystems. The directory may also contain information indicating subsystems that currently own particular blocks of data. Accordingly, responses to commands may additionally include probes that cause an owning subsystem to convey data to a requesting subsystem. Numerous variations of directory based cache coherency protocols are well known.
Since probes must be broadcast to all other processors in systems that employ broadcast cache coherency protocols, the bandwidth associated with the network that interconnects the processors can quickly become a limiting factor in performance, particularly for systems that employ large numbers of processors or when a large number of probes are transmitted during a short period. In addition to a possible bandwidth issue, latency of memory accesses may increase due to probes. For example, when a processor performs a memory request that misses in the processor's cache hierarchy, the required data may be retrieved from DRAM and returned to the memory controller prior to the completion of all the probes. Therefore, the latency of memory accesses increases.
Directory based protocols reduce the number of probes contributing to network traffic by conditionally sending probes, rather than unconditionally sending them. Therefore, systems employing directory based protocols may attain overall higher performance due to lessened network traffic and reduced latencies of memory requests. However, while directory based systems may allow for more efficient cache coherency protocols, additional hardware is often required.
The directory based protocol often includes a directory cache that may be implemented on an Application Specific Integrated Circuit (ASIC) or other semi-custom chip separate from the processor. When the directory cache is implemented on a separate chip, the overall cost of the system may increase, as well as board requirements, power consumption, and cooling requirements. On the other hand, incorporation of a directory cache on the same chip as the processor core may be undesirable, particularly for commodity microprocessors intended for use in both single processor or multiple processor systems. When used in a single processor system, the directory cache would go unused, thus wasting valuable die area and adding cost due to decreased yield.
A third alternative stores directory entries in designated locations of a cache memory subsystem, such as an L2 cache, associated with a processor core. For example, a designated way of the cache memory subsystem may be allocated for storing directory entries, while the remaining ways of the cache are used to store normal processor data. In one particular implementation, directory entries are stored within the cache memory subsystem to provide indications of lines (or blocks) that may be cached in modified, exclusive, or owned coherency states. The absence of a directory entry for a particular block may imply that the block is cached in either shared or invalid states. Further details may be found in P. Conway, Computer System with Integrated Directory and Processor Cache, U.S. Pat. No. 6,868,485, 16005.
However, the third alternative is not able to provide a high coverage ratio without occupying a significant portion of a frequently used cache. If a significant portion is used for the directory, then less lines for data may be used within the cache. Therefore, more cache misses, such as capacity and conflict misses, may occur. In order to reduce the amount of cache space to use for the directory, lines with certain states may be determined to not be cached. However, the absence of a directory entry for a block may cause probes to be sent and increase network traffic.
In view of the above, efficient methods and mechanisms for a cache coherency protocol across multi-processor nodes is desired.