1. Field of the Invention
Embodiments of the present invention relate, in general, to message passing and particularly to message passing in distributed shared memory architectures implementing hybrid cache coherence protocols.
2. Relevant Background
Parallel computing is the simultaneous execution of some combination of multiple instances of programmed instructions and data on multiple processors in order to obtain results faster. One approach to parallel (multithread) processing utilizes multiple processors accessing a shared memory system. The complexity, however, to provide a fully cache-coherent shared memory space is typically high, thus spawning several different approaches to address this need. One such approach relies on distributed memory in which nodes can only access their local memory; data can only be transferred between nodes via explicit messages. Messaging passing systems are routinely scaled to thousands of nodes since all data transfer is explicitly managed by application software. In spite of their scalability, message passing systems are usually much more difficult to program than shared memory systems, requiring many more lines of code.
Distributed shared memory is essentially an architectural approach designed to overcome the scaling limitations of symmetric shared memory multi-processors while retaining a shared memory model for communication and programming. This is achieved by using a memory that is physically distributed, but logically implements a single shared address space, allowing the processor to communicate through, and share the contents of, the entire memory. In addition to the sharing of data, distributed shared memory is also concerned with an interconnection network that can provide data to a requesting processor in an efficient and timely fashion. Bandwidth (the amount of data that can be supplied in a unit of time) and latency (the time it takes a node to receive the first piece of requested data from the time the request is issued) are both important. Distributed shared memory involves moving data dynamically across the memory layers of a distributed system. One approach to such movement is for the data to be uniquely mapped to a physical address in a cache coherent system. Data can be replicated and directories track the multiple copies. The coherency of this data is maintained, typically by either hardware or software. Hardware cache coherence solutions often manage data at a much finer granularity (e.g. typically 64B blocks) than software solutions.
One approach to cache coherency, as is known to one skilled in the art, utilizes snoopy or snooping protocols. Snooping is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location. The basic idea is to enforce the property that before a memory location is written, all other copies of the location which may be present in other caches, are invalidated. Thus, the system allows multiple copies of a memory location to exist when it is being read, but only one copy when it is being written. When a processor wants to write into a cache block that may be shared, a snoopy protocol transmits the request to all other processors over the interconnection network, and all caches that have a copy of the cache block simply invalidate the copy. Unfortunately, the broadcasting of all miss requests required by snoopy schemes does not scale to large systems.
Directory based protocols are another approach to cache coherency as known in the art. Directory based schemes rely on an extra structure called the directory that tracks which processors have cached any given block in main memory. To maintain coherence the state of each cache block is tracked in the cache and additional information is kept in the directory for each block. The directory is nothing more than a piece of memory or a table on a node that holds information about the memory of that node. A simple protocol operates with the three states of invalid, shared or exclusive. Unlike the snoopy system, the directory based protocol system obtains the information about which processors are sharing a copy of the data from a known location rather than interrogating all the processors by a broadcast.
Distributed directory protocols are therefore a cache-coherency architecture that builds on the directory concept but distributes the directory just as a block of memory is distributed. Although simple in concept, this approach introduces many complexities due to the use of messages. Since few of the protocol actions can be atomic, the protocol is implemented by sending messages among, 1) a requesting processor node (the requester “R”) also known as the local node; 2) the node containing the address of the data block that the local node desires to read or write (also known as the home node “H”); and 3) a remote node that contains the cache block when it is in the exclusive state (sometimes referred to as the target node “T”). Thus at least two messages are required. A first message from the local to the home node to request a cache block and a second message from the home node to the local node to reply with the data.
Coherent shared memory support in message passing systems is generally realized by emulating it in software and/or by compiler directed coherence. These techniques have limitations in applicability and performance. Compiler based coherence is problematic for system software and commercial applications, such as databases. Nonetheless, a number of software shared memory schemes have been proposed. These schemes provide a software implementation of coherence protocols but vary in the extent of application/binary modification, kernel/user-level implementation, granularity of coherence and other system specific performance optimizations.
A number of software distributed shared memory schemes have been proposed, as will be recognized by those skilled in the relevant art. These schemes are generally inefficient because they maintain coherence at the granularity of pages and/or require extra instructions to perform shared load or store operations. Maintaining coherence at page granularity is inefficient because it increases false sharing, which increases cache misses and coherence traffic.