Historically, large-scale parallel computer systems were constructed with specialized processors and customized interconnects and, consequently, were characterized by a high cost and a long time-to-market. Currently, multi-computer systems (e.g., clustered computer systems) are being built with standard processors and standard networks. By using standard components and networks, such multi-computer systems are cheaper to design and may be brought to market in a shorter time. Multi-computer systems consist of a parallel or distributed collection of whole computers (referred to herein as “nodes) that cooperate to perform computing tasks. In general, a node may include one or more processors, a memory, input/output facilities, and an operating system. A cluster is a type of multi-computer system that may be used as a single, unified computing resource.
Many different shared memory processing systems have been developed. For example, symmetric multiprocessing (SMP) systems have been developed in which multiple processors on a bus, or a plurality of busses, share a single global memory. SMP machines execute only one copy of the operating system. While tasks can be given to different processors to perform, they cannot be given to different copies of the operating system. In shared memory multiprocessor systems, all memory is uniformly accessible to each processor, simplifying the task of dynamic load distribution. Complex tasks may be distributed among various processors in an SMP system, while the data used for processing is available to each of the processors in the system. In general, programmers writing code for such shared memory SMP systems need not be concerned with data partitioning issues because each of the processors has access to and shares the same, consistent global memory.
Multi-computer architectures based on cache coherent non-uniform memory access (CCNUMA) have been developed as an extension of the shared memory architecture of SMP systems. Shared memory multi-computer systems, unlike SMP systems, execute different copies of the operating system on each of the processors or groups of processors in the system. CCNUMA architectures typically are characterized by a distributed global memory. In general, CCNUMA machines consist of a number of processing nodes that are connected through a high bandwidth, low latency shared memory interconnection network. Each of the processing nodes includes one or more high-performance processors, each having an associated cache, and a portion of a global shared memory. Each node has a near memory and a far memory. Near memory is resident on the same physical circuit board as the node processors and is directly accessible to the node processors over a local memory bus. Far memory is resident on other nodes and is accessible over a main system interconnect. Cache coherence (i.e., the consistency and integrity of shared data stored in multiple caches) typically is maintained by a directory-based, write-invalidate cache coherency protocol. To determine the status of caches, each processing node typically has a directory memory corresponding to its respective portion of the shared physical memory. For each discrete addressable block of memory, the directory memory stores an indication of remote nodes that are caching that same block of memory.
In general, when developing a multi-computer system it is desirable to provide a computing environment that may run a wide variety of existing application programs, including those that were developed for other parallel computing environments (e.g., an SMP computing environment), without requiring significant re-programming. The single address space of shared memory multi-computer systems increases the programmability of multiprocessors by reducing problems, such as data partitioning and dynamic load distribution. The shared address space also provides better support for parallelizing compilers, standard operating systems, multiprogramming, and incremental tuning of parallel machines. One difficulty associated with shared memory multi-computer systems, however, involves synchronizing access to shared resources, particularly when an application program that originally was coded under the assumption that it was the only application program having access to the system resources.