1. Technical Field of the Invention
This invention relates to digital parallel processing systems, wherein a plurality of nodes communicate via messages over an interconnection network and share the entire memory of the system. In particular, this invention deals with distributing the shared memory amongst all the system nodes, such that each node implements a portion of the entire memory. More specifically, the invention relates to a tightly coupled system including local caches at each node, and a method for maintaining cache coherency efficiently across a network using distributed directories, invalidation, read requests, and write-thru updates.
2. Background Art
As more and more processor performance is demanded for computing and server systems, shared memory processors (SMPs) are becoming an important option for providing better performance. SMPs comprise a plurality of processors that share a common memory pool with a part or most of the memory pool being remote from each processor. There are basically two types of multiprocessing systems: tightly coupled and loosely coupled. In a tightly coupled multiprocessor, the shared memory is used by all processors and the entire system is managed by a single operating system. In a loosely coupled multiprocessor, there is no shared memory and each processor has an exclusive memory, which can be loaded from the network if desired.
For either tightly or loosely coupled systems, the accessing of memory from a remote node or location is essential. Accessing remote memory verses local memory is a much slower process and requires performance enhancement techniques to make the remote access feasible. The first performance technique uses local caches (usually several levels of cache) at each processor. Cache memories are well known in the art for being a high performance local memory and alleviating traffic problems at the shared memory or network. A cache memory comprises a data array for caching a data line retrieved from the shared memory, where a cache data line is the basic unit of transfer between the shared memory and the cache. Since the cache size is limited, the cache also includes a directory for mapping the cache line from shared memory to a location within the cache data array. The cache contains either instructions or data, which sustain the processor's need over a period of time before a refill of the cache lines are required. If the data line is found in the cache, then a cache "hit" is said to have occurred. Otherwise, a cache "miss" is detected and refill of a cache line is required, where the refill replaces a cache line that has been least recently used. When a multi-processing system is comprised of distributed shared memory, the refill can come from the local shared memory or remote shared memory resident in a different node on the network. Conventionally, caches have been classified as either "write-back" or "write-thru". For a write-thru cache, changed data is immediately stored to shared memory, so that the most recent data is always resident in the shared memory. For a write-back cache, changed data is held in the cache and only written back to shared memory when it is requested by a another node or replaced in the cache.
The execution of programs and the fetching of variables from shared memory at a remote node takes many processor cycle times (15 cycles at best and usually a lot more). The larger the system, the larger the distance to the remote memory, the more chance of conflict in the interconnection scheme, and the more time wasted when fetching from remote memory.
A second performance enhancement technique becoming popular is multi-threading, as disclosed by Nikhil et al in U.S. Pat. No. 5,499,349 "Pipelined Processor using Tokens to Indicate the Next Instruction for Each Multiple Thread of Execution" and N. P. Holt in U.S. Pat. No. 5,530,816 "Data Processing System for Handling Multiple Independent Data-driven Instruction Streams". The multi-threading technique uses the time when the processor becomes stalled because it must fetch data from remote memory, and switches the processor to work on a different task (or thread).
Traditionally, cache coherency is controlled by using a multi-drop bus to interconnect the plurality of processors and the remote memory, as disclosed by Wilson, Jr. et al in U.S. Pat. No. 4,755,930, "Hierarchical Cache Memory System and Method". Using a multi-drop bus, cache updating is a rather simple operation. Since the bus drives all processors simultaneously, each processor can "snoop" the bus for store operations to remote memory. Anytime a variable is stored to remote memory, each processor "snoops" the store operation by capturing the address of remote memory being written. It then searches its local caches to determine whether a copy of that variable is present. If it is, the variable is replaced or invalidated. If it is not, no action is taken.
Cache coherency is not so easy over networks. This is because a network cannot be snooped. A network establishes multiple connections at any time; however, each connection is between two of the plurality of nodes. Therefore, except for the two nodes involved in the transfer of data, the other nodes do not see the data and cannot snoop it. It is possible to construct a network that operates only in broadcast mode, so that every processor sees every data transfer in the system. J. Sandberg teaches this approach in U.S. Pat. No. 5,592,625, "Apparatus for Providing Shared Virtual Memory Among Interconnected Computer Nodes with Minimal Processor Involvement". Sandberg uses only writes over the network to broadcast any change in data to all nodes, causing all nodes to update the changed variable to its new value. Sandberg does not invalidate or read data over the network, as his solution assumes that each node has a full copy of all memory and there is never a need to perform a remote read over the network. Sandberg's write operation over the network to update the variables at all nodes negates the need for invalidation because he opts to replace instead of invalidate. This defeats the major advantage of a network over a bus; i.e., the capability to perform many transfers in parallel is lost since only one broadcast is allowed in the network at a time. Thus, Sandberg's approach reduces the network to having the performance of a serial bus and restricts it to performing only serial transfers--one transfer at a time. This effectively negates the parallel nature of the system and makes it of less value.
A further problem with SMP systems is that they experience performance degradation when being scaled to systems having many nodes. Thus, state-of-the-art SMP systems typically use only a small number of nodes. This typical approach is taught by U.S. Pat. No. 5,537,574, "Sysplex Shared Data Coherency Method" by Elko et al, and allows shared memory to be distributed across several nodes with each node implementing a local cache. Cache coherency is maintained by a centralized global cache and directory, which controls the read and store of data and instructions across all of the distributed and shared memory. No network is used, instead each node has a unique tail to the centralized global cache and directory, which controls the transfer of all global data and tracks the cache coherency of the data. This method works well for small systems but becomes unwieldy for middle or large scale parallel processors, as a centralized function causes serialization and defeats the parallel nature of SMP systems.
A similar system having a centralized global cache and directory is disclosed in U.S. Pat. No. 5,537,569, "Multiprocessor System Utilizing a Directory Memory and Including Grouped Processing Elements Each Having Cache" by Y. Masubuchi. Masubuchi teaches a networked system where a centralized global cache and directory is attached to one node of the network. On the surface, Masubuchi seems to have a more general solution than that taught by Elko in U.S. Pat. No. 5,537,574, because Masubuchi includes a network for scalability. However, the same limitations of a centralized directory apply and defeat the parallel nature of SMP systems based upon Masubuchi.
The caching of remote or global variables, along with their cache coherency, is of utmost importance to high performance multi-processor systems. Since snoopy protocols broadcasting write only messages or using one central directory are not tenable solutions for scalability to a larger number of nodes, there is a trend to use directory-based protocols for the latest SMP systems. The directory is associated with the shared memory and contains information as to which nodes have copies of each cache line. A typical directory is disclosed by M. Dubois et al, "Effects of Cache Coherency in Multiprocessors", IEEE Transactions on Computers, vol.C-31, no. 11, November, 1982. Typically, the lines of data in the cache are managed by the cache directory, which invalidates and casts out data lines which have been modified. All copies of the data line are invalidated throughout the system by an invalidation operation, except the currently changed copy is not invalidated.
In related art, loosely coupled computer systems have been disclosed for transferring large blocks or records of data from disk drives to be stored and instructions executed at any node of the system. In U.S. Pat. No. 5,611,049, "System for Accessing Distributed Data Cache Channel at Each Network Node to Pass Requests and Data" by W. M. Pitts, Pitts teaches a special function node called a Network Distributed Cache (NDC) site on the network which is responsible for accessing and caching large blocks of data from the disk drives, designating each block as a data channel, forwarding the data to requesting nodes, and maintaining coherency if more than one node is using the data. The system is taught for local area networks, wherein nodes share large blocks of data, and the shared memory is the storage provided by the NDC. This is a good approach for local area networks and loosely coupled computer systems, but would cause unacceptably long delays between distributed shared memory nodes of tightly coupled parallel processing nodes.
Baylor et al in U.S. Pat. No. 5,313,609, "Optimum Write-back Strategy for Directory-Based Cache Coherence Protocols" teaches a system of tightly coupled processors. Baylor solves the problem of a single shared, centralized memory being a bottleneck, when all processors collide while accessing the single shared memory unit. Baylor disperses and partitions the shared memory into multiple (n) shared memory units each. uniquely addressable and having its own port to/from the network. This spreads the traffic over n shared memory modules, and greatly improves performance. Baylor organizes the system by placing all the processing nodes on one side of the network and all the shared memory units on the other side of the network, which is a normal view of a shared memory system having multiple processors and multiple shared memory units. However, this organization is not designed for the computers in the field today, which combine processors and memory at the same node of the network. To provide cache coherency, Baylor uses write-back caches and distributed "global directories", which are a plurality of directories--associated with each shared memory unit. Each global directory tracks the status of each cache line in its associated shared memory unit. When a processor requests the cache line, the global directory poles the processors having copies of the requested cache line for changes. The processors write-back to the global directory any modifications to the cache line, and then the global directory returns the updated cache line to the requesting processor. Only shared memory and the requesting node are provided the modified copy of the cache line. Other nodes must periodically request a copy if they wish to stay coherent. The method has the disadvantage of requiring a long access time to shared memory because cache coherency is provided in series with the request for shared memory data.
A. Gupta et al in U.S. Pat. No. 5,535,116, "Flat Cache-Only Multiprocessor Architecture" teaches a different directory based cache coherency system with distributed directories, which is the prior art that is most similar to the present invention. However, Gupta's invention is targeted towards Attraction Memory (AM) located at each node, instead of shared memory. Gupta defines AM as large secondary or tertiary caches storing multiple pages of data which replace main memory at each node and provide a Cache-Only Multiprocessor. A page is defined as being up to 4K bytes of sequential data or instructions. A page of data is not assigned to any specific node, but can be located in the secondary or tertiary cache at any node which has read that page from disk storage. This complicates the directories and the copying of data to various nodes. Each processing node is assigned as a "home" node to a set of physical addresses to track with its portion of the distributed directory. Since each cache data line does not usually reside at the home node having the directory which is tracking it, Grupta requires four network messages to access any cache line from a requesting node. The requesting node sends the read request over the network to the home node first. The home node access its directory to find the "master" node; i.e., the node which has the master copy of the requested data. The home node then sends the read request across the network a second time to the master node. The master node returns a copy of the requested data over the network to the requesting node. The requesting node then sends an acknowledgement message to the home node to verify that it has received the requested data, and the home node records in its directory that the requesting node has a copy of the data line. The present invention differs in that it is more efficient, having statically assigned shared memory at each node and requiring only two network messages to access any cache line. A read request goes to the node implementing the shared memory location, the data is accessed and returned while the directory is updated in parallel.
It is the object of this invention to provide an improved method and apparatus for maintaining cache coherency in a tightly coupled system.
It is a further object of the invention to maintain cache coherency over a network operating in full parallel mode through use of a write-thru cache, invalidation of obsolete data, and a distributed directory.
It is a further object of this invention to provide a tightly coupled system whereby each processing node contains a portion of the shared memory space, and wherein any node can access its local portion of shared memory or the remote portion of shared memory contained at other nodes over the network in the most expedient manner.
It is a further object of this invention to provide a directory-based cache coherency approach using a write-thru cache, invalidation of obsolete data, and a distributed directory whereby cache coherency is maintained over a network without performing broadcasts or multicasts over the network.
It is a further object of this invention to enable normal SMP performance enhancement techniques, such as caching and multi-threading, to be used with SMPs when operating over multi-stage networks.
It is a further object of this invention to support the reading and invalidation of cache lines from remote nodes over the network by implementing six different FIFOs in the network adapter for expediting remote fetches, remote stores, and invalidations over the network.
It is a further object of this invention to mark shared memory areas as containing changeable or unchangeable data, and to mark each data double-word as being changeable or unchangeable data for the purpose of providing a more efficient cache coherent system.
It is the further object of this invention to provide a small and efficient set of special-purpose messages for transmission across the network for requesting remote data, invalidating remote data, storing remote data, and responding to remote read requests.