1. Field of Invention
The present invention generally relates to cache coherence for multiprocessor data processing systems, and more particularly to cache coherence for a plurality of multiprocessor nodes, each node having a snoopy bus protocol.
2. Discussion of the Background Art
Multiprocessor architectures are classified according to types of address space and memory organization. Address space architecture classifications are based upon the mechanism by which processors communicate. Processors communicate either by explicit messages sent directly from one processor to another or by access through shared-memory address space. The first classification is called a message passing architecture while the second is a shared-memory architecture.
Memory organization is classified as centralized or distributed. In a centralized organization memory system, the entire memory is located concentrically or symmetrically with respect to each processor in the system. Thus, each processor has equivalent access to a given memory location. In a distributed organization system, on the other hand, each processor within the multiprocessor system has an associated memory that is physically located near the processor; furthermore, every processor has the capability of directly address its own memory as well as the remote memories of the other processors. A distributed, shared-memory system is known as a distributed shared-memory (DSM) or a non-uniform memory access (NUMA) architecture.
DSM architecture provides a single shared address space to the programmer where all memory locations may be accessed by every processor. As there is no need to distribute data or explicitly communicate data between the processors in software, the burden of programming a parallel machine is simpler in a DSM model. In addition, by dynamically partitioning the work, DSM architecture makes it easier to balance the computational load between processors. Finally, as shared memory is the model provided on small-scale multiprocessors, DSM architecture facilitates the portability of programs parallelized for a small system to a larger shared-memory system. In contrast, in a message-passing system, the programmer is responsible for partitioning all shared data and managing communication of any updates.
The prior art provides numerous examples of DSM architectures. However, such systems communicate through high bandwidth buses or switching networks, and the shared-memory increases data latency. Latency is defined as the time required to access a memory location within the computer, and describes the bottleneck impeding system performance in multiprocessor systems. Latency is decreased in DSM systems by memory caching and hardware cache-coherence.
Caching involves placing high-speed memory adjacent to a processor where the cache is hardware rather than software controlled. The cache holds data and instructions that are frequently accessed by the processor. A cache system capitalizes on the fact that programs exhibit temporal and spatial locality in their memory accesses. Temporal locality refers to the propensity of a program to again access a location that was recently accessed, while spatial locality refers to the tendency of a program to access variables at locations near those that were recently accessed.
Cache latency is typically several times less than that of main system memory. Lower latency results in improved speed of the computer system. Caching is especially important in multiprocessor systems where memory latency is higher because they are physically larger, but caching does introduce coherence problems between the independent caches. In a multiprocessor system, it becomes necessary to ensure that when a processor requests data from memory, the processor receives the most up-to-date copy of the data to maintain cache coherence.
Protocols incorporated in hardware have been developed to maintain cache coherence. Most small-scale multiprocessor systems maintain cache coherence with a snoopy protocol. This protocol relies on every processor monitoring (or xe2x80x9csnoopingxe2x80x9d) all requests to memory. Each cache independently determines if accesses made by another processor require an update. Snoopy protocols are usually built around a central bus (a snoopy bus). Snoopy bus protocols are very common, and many small-scale systems utilizing snoopy protocols are commercially available.
To increase the processing power of computer systems, manufacturers have attempted to add more processing units to existing systems. When connecting additional microprocessors to the main bus to help share the workload, processing power is added linearly to the system while maintaining the cost-performance of the uni-processor. In such systems, however, bus bandwidth becomes the limiting factor in system performance since performance decreases rapidly with an increase in the number of processors.
In order to overcome the scaling problem of bus-based cache coherence protocols, directory-based protocols have been designed. In directory based systems, the state of each memory line is kept in a directory. The directory is distributed with memory such that the state of a memory line is attached to the memory where that line lives. The caches are kept coherent by a point-to-point cache coherence protocol involving the memory system and all the processor caches.
U.S. Pat. No. 5,029,070 to McCarthy et al. discloses a method for maintaining cache coherence by storing a plurality of cache coherency status bits with each addressable line of data in the caches. McCarthy et al. specifically rejects storing the plurality of cache coherency status bits in the global memory. A plurality of state lines are hardwired to the bus master logic and bus monitor logic in each cache. The state lines are ORed so that all the states of all the same type of cache coherency bits in every cache except for the line undergoing a cache miss appear on the state line. This allows the bus master to rapidly determine if any other cache has a copy of the line being accessed because of a cache miss.
U.S. Pat. No. 5,297,269 to Donaldson et al. discloses a system for point-to-point cache coherence in a multi-node processing system where the coherency is maintained by each main memory module through a memory directory resident on the individual memory module. The memories and nodes are coupled together by means of a cross bar switch unit coupled point-to-point to one or more main memory modules. The memory directory of each main memory module contains a plurality of coherency state fields for each data block within the module. Each main memory module maintains the coherency between nodes. The module queries its own directory upon each data transfer operation that affects the coherency state of a data block.
Sequent (T. Lovett and R. Clapp, StiNG, xe2x80x9cA CC-NUMA Computer System for the Commercial Marketplace,xe2x80x9d Proceedings of the 23rd International Symposium on Computer Architecture, pages 308-317, May 1996) and Data General (R. Clark and K. Alnes, xe2x80x9cAn SCI Interconnect Chipset and Adapter,xe2x80x9d Symposium Record, Hot Interconnects IV, pages 221-235, August 1996) disclose machines that interconnect multiple quad Pentium Pro nodes into a single shared-memory system. These two systems both utilize an SCI-based interconnect, a micro-coded controller, and a large per-node cache. The use of the SCI coherence protocol prevents close coupling of the inter-node coherence mechanism to the intra-node (snoopy) coherence mechanisms. The mismatch between the two protocols requires the use of a large L3 (node-level) cache to store the coherence tag information required by the SCI protocol, to correct the mismatch of cache line size, and to adapt the coherence abstraction presented by the processing node to that required by SCI. In addition, the complexity of the SCI coherence protocol invariably leads to programmable implementations that are unable to keep up with the pipeline speed of the processor bus, and that can only process one request at a time. The result is a coherence controller that is large, expensive, and slow.
What is needed is an inter-node coherence mechanism that is simple, fast, and well-matched to the pipelined snoopy protocol. Such a mechanism can be very tightly coupled to the processor bus and can thus achieve higher performance at lower cost.
This invention includes the cache coherence protocol for a sparse directory in combination with multiprocessor nodes, each node having a memory and a data bus operating under a pipelined snoopy bus protocol. In addition, the invention has the following features: the current state and information from the incoming bus request are used to make an immediate decision on actions and next state; the decision mechanism for outgoing coherence is pipelined to follow the bus; and the incoming coherence pipeline acts independently of the outgoing coherence pipeline.
The invention implements the cache coherence protocol within a cache coherence unit for use in a data processing system. The data processing system is comprised of multiple nodes, each node having a plurality of processors with associated caches, a memory, and input/output. The processors within the node are coupled to a memory bus operating according to a xe2x80x9csnoopyxe2x80x9d protocol.
Multiple nodes are coupled together using an interconnection network, with the mesh coherence unit acting as a bridge between the processor/memory bus and the interconnection network. The mesh coherence unit is attached to the processor/memory bus and the interconnection network. In addition, it has a coherence directory attached to it. This directory keeps track of the state information of the cached memory locations of the node memory. The mesh coherence unit follows bus transactions on the processor/memory bus, looks up cache coherence state in the directory, and exchanges messages with other mesh coherence units across the interconnection network as required to maintain cache coherence.
The invention incorporates close coupling of the pipelined snoopy bus to the sparse directory. In addition, the invention incorporates dual coherence pipelines. The purpose of having dual coherence pipelines is to be able to service network requests and bus requests at the same time in order to increase performance. Finally, the invention incorporates a coherence protocol where all protocol interactions have clearly defined beginnings and endings. The protocol interactions end the transactions on a given line before the interaction of a new line may begin. This process is achieved by the invention keeping track of all the transient states within the system.