This application is related to, and hereby incorporates by reference, the following U.S. patent applications:
Multiprocessor Cache Coherence System And Method in Which Processor Nodes And Input/output Nodes Are Equal Participants, Ser. No. 09/878,984, filed Jun. 11, 2001;
Scalable Multiprocessor System And Cache Coherence Method, Ser. No. 09/878,982, filed Jun. 11, 2001;
System and Method for Daisy Chaining Cache Invalidation Requests in a Shared-memory Multiprocessor System, Ser. No. 09/878,985, filed Jun. 11, 2001;
Cache Coherence Protocol Engine And Method For Processing Memory Transaction in Distinct Address Subsets During Interleaved Time Periods in a Multiprocessor System, Ser. No. 09/878,983, filed Jun. 11, 2001;
System And Method For Generating Cache Coherence Directory Entries And Error Correction Codes in a Multiprocessor System, Ser. No. 09/972,477, filed Oct. 5, 2001, which claims priority on U.S. provisional patent application 60/238,330, filed Oct. 5, 2000, which is also hereby incorporated by reference in its entirety.
The present invention relates generally to the design of cache coherence protocol directories, and particularly to the minimization of directory information required in the context of logically independent input/output nodes.
When multiple processors with separate caches share a common memory, it is necessary to keep the caches in a state of coherence by ensuring that cached copies of shared memory lines of information are invalidated when changed by another processor. This is done in either of two ways: through a directory-based or a snooping system. In a directory-based system, sharing information is placed in a directory that maintains the coherence between caches. The directory acts as a filter through which a processor must ask permission to load an entry from a primary memory to a cache. In a snooping system (i.e., snoop based) each cache monitors (i.e., snoops) a bus for requests for memory lines of information broadcast on the bus, and responds if able to satisfy the request.
Additionally, the common bus-based design for most small-scale multiprocessor systems is not used for larger-scale multiprocessors because current buses do not accommodate the bandwidth requirements of high performance processors typically included in larger-scale multiprocessors systems. Large-scale multiprocessor systems, therefore, use a more scalable interconnect that provides point-to-point connections between processors.
However, the more scalable interconnect does not include broadcast capabilities. The large-scale multiprocessors cannot, therefore, use a snoop based cache-coherence protocol.
Instead, large-scale multiprocessors typically use a directory-based cache coherence protocol. As indicated above, a directory is a cache-coherence protocol data structure that maintains information about which processors are caching one or more lines memory lines of information in the system. This information is used by the cache-coherence protocol to invalidate cached copies of a memory line of information when the contents of the memory line of information are modified (i.e., subject to a request for exclusive ownership). A common directory implementation is to use a full bit vector, wherein each bit indicates whether a corresponding processor is caching a copy of an associated memory line of information.
Furthermore, large-scale multiprocessor systems typically include input/output (I/O) devices that are connected to one or more processor nodes, which manage any connected I/O devices and process requests from other processor nodes directed to any connected I/O devices.
There are two alternatives with respect to how data maintained by an I/O device is accessed by other processor nodes. In some large-scale multiprocessor systems, no distinction is made between a processor included in a processor node or an I/O device connected to the processor node. In these systems, the processor node determines whether a particular request is routed to an included processor or a connected I/O device.
In other large-scale multiprocessor systems, requests indicate whether the request is directed to a processor or an I/O device. In these systems, a directory must include information that distinguishes between processors and I/O devices.
In still other large-scale multiprocessor systems, I/O devices are connected xe2x80x9cdirectlyxe2x80x9d to the network that interconnects the processor nodes of the multiprocessor system (xe2x80x9cinterconnection networkxe2x80x9d) through I/O nodes. The I/O devices connected to the I/O nodes are, therefore, accessed efficiently by all processor nodes. More specifically, the ability to access an I/O device is not limited by the ability of a processor node to process requests directed to a connected I/O device and requests directed to an included processor. These I/O nodes typically include caches to reduce the need to transfer data to and from other processor and I/O nodes and, therefore, participate in the cache-coherence protocol.
In balanced, large-scale multiprocessor systems, the number of I/O nodes is equal to, or nearly equal to, the number of processor nodes. Requiring directories to include information to distinguish between I/O and processor nodes requires, therefore, a potentially large increase in the size of the directories. This is particularly true for full bit vectors, in which each bit is never associated with more than one node. In such systems, the directories include perfect sharing information (i.e., each node sharing a memory line of information is identifiable). For example, if the number of I/O nodes equals the number of processor nodes and an extra bit is required for each of the I/O nodes, the size of the directory roughly doubles.
But the addition of I/O nodes is also an issue for systems that support coarse-vector directory formats. In such systems, the issue is not additional directory bits, but rather the coarseness of the directory entries. As described more fully below, a single bit in a directory using the coarse-vector format may be associated with one or more nodes. Increasing the number of nodes but not the number of bits results in an increase in the number of nodes associated with each such bit. As a result, a greater number of invalidation acknowledgments are required when an exclusive request is received, even though only one of the nodes associated with a given bit actually shares the corresponding memory line of information.
Thus, connecting I/O devices xe2x80x9cdirectlyxe2x80x9d to the interconnection network of a large-scale multiprocessor system through I/O nodes presents problems for directory structures regardless of the particular directory format used.
Another important observation is the distinction between the way in which a processor node and an I/O node access memory lines of information. I/O nodes (i.e., I/O devices) do not typically access the same data over and over, as is the case with processor nodes. Instead, I/O nodes tend to access data sequentially and use caches to exploit the spatial locality in their accesses. In other words, caches improve the performance of I/O nodes by ensuring that there is only one miss per memory line of information as the I/O nodes sequentially access data. Once an I/O node has accessed all the data in a particular memory line of information, the I/O node will typically not access the same memory line of information in the near term. The present invention exploits this aspect of I/O nodes to conserve resources allocated to manage the sharing of memory lines of information by I/O nodes without substantially impacting the performance of the I/O nodes.
A system of scalable shared-memory multiprocessors includes processor nodes and I/O nodes. The I/O nodes connect I/O devices directly to an interconnection network of a system of scalable shared-memory multiprocessors. Each node of the system includes an interface to a local memory subsystem, a memory cache and a protocol engine. The local memory subsystem stores memory lines of information and a directory. Each entry in the directory stores sharing information concerning a memory line of information stored in the local memory subsystem. The protocol engine of each node includes a memory transaction array for storing an entry related to a memory transaction concerning a memory line of information, and logic for processing the memory transaction, including advancing the memory transaction when predefined criteria are satisfied and storing a state of the memory transaction in the memory transaction array. The protocol engine included in each I/O node is configured to limit to a predefined period of time any sharing of a memory line of information from the memory subsystem of any other node. The protocol engine included in the home node of the memory line is configured to identify only nodes other than I/O nodes that are sharing the memory line of information. In one embodiment, I/O nodes that share the memory line of information are not identified in the directory entry of the memory line, and instead are represented by a count field, which indicates how many I/O nodes share the memory line of information.