Symmetric multiprocessing (SMP) computer architectures are known in the art as overcoming the limitations of single or uniprocessors in terms of processing speed and transaction throughput, among other things. Typical, commercially available SMP systems are generally "shared memory" systems, characterized in that multiple processors on a bus, or a plurality of busses, share a single global memory. In shared memory multiprocessors, all memory is uniformly accessible to each processor, which simplifies the task of dynamic load distribution. Processing of complex tasks can be distributed among various processors in the multiprocessor system while data used in the processing is substantially equally available to each of the processors undertaking any portion of the complex task. Similarly, programmers writing code for typical shared memory SMP systems do not need to be concerned with issues of data partitioning, as each of the processors has access to and shares the same, consistent global memory.
However, SMP systems suffer disadvantages in that system bandwidth and scalability are limited. Although multiprocessor systems may be capable of executing many millions of instructions per second, the shared memory resources and the system bus connecting the multiprocessors to the memory presents a bottleneck as complex processing loads are spread among more processors, each needing access to the global memory. As the complexity of software running on SMP's increases, resulting in a need for more processors in a system to perform complex tasks or portions thereof, the demand for memory access increases accordingly. Thus more processors does not necessarily translate into faster processing, i.e. typical SMP systems are not scalable. That is, processing performance actually decreases at some point as more processors are added to the system to process more complex tasks. The decrease in performance is due to the bottleneck created by the increased number of processors needing access to the memory and the transport mechanism, e.g. bus, to and from memory.
Alternative architectures are known which seek to relieve the bandwidth bottleneck. Computer architectures based on Cache Coherent Non-Uniform Memory Access (CCNUMA) are known in the art as an extension of SMP that supplants SMP's "shared memory architecture." CCNUMA architectures are typically characterized as having distributed global memory. Generally, CCNUMA machines consist of a number of processing nodes connected through a high bandwidth, low latency interconnection network. The processing nodes are each comprised of one or more high-performance processors, associated cache, and a portion of a global shared memory. Each node or group of processors has near and far memory, near memory being resident on the same physical circuit board, directly accessible to the node's processors through a local bus, and far memory being resident on other nodes and being accessible over a main system interconnect or backbone. Cache coherence, i.e. the consistency and integrity of shared data stored in multiple caches, is typically maintained by a directory-based, write-invalidate cache coherency protocol, as known in the art. To determine the status of caches, each processing node typically has a directory memory corresponding to its respective portion of the shared physical memory. For each line or discrete addressable block of memory, the directory memory stores an indication of remote nodes that are caching that same line.
One known implementation of the CCNUMA architecture is in a scalable, shared memory multiprocessor system known as "DASH" (Directory Architecture for SHared memory), developed at the Computer Systems Laboratory at Stanford University. The DASH architecture, described in The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor, Lenoski et al., Proceedings of the 14th Int'l Symp. Computer Architecture, IEEE CS Press, 1990, pp 148-159, which is incorporated herein by reference, consists of a number of processing nodes connected through a high-bandwidth, low-latency interconnection network. As is typical in CCNUMA machines, the physical memory is distributed among the nodes of the multiprocessor, with all memory accessible to each node. Each processing node consists of: a small number of high-performance processors; their respective individual caches; a portion of the shared-memory; a common cache for pending remote accesses; and a directory controller interfacing the node to the network.
A weakly ordered memory consistency model is implemented in DASH, which puts a significant burden relating to memory consistency on software developed for the DASH system. In effecting memory consistency in the DASH implementation of CCNUMA architecture, a "release consistency" model is implemented, which is characterized in that memory operations issued by a given processor are allowed to be observed and completed out of order with respect to other processors. ordering of memory operations is only effected under limited circumstances. Protection of variables in memory is left to the programmer developing software for the DASH multiprocessor, as under the DASH release consistency model the hardware only ensures that memory operations are completed prior to releasing a lock on the pertinent memory. Accordingly, the release consistency model for memory consistency in DASH is a weakly ordered model. It is generally accepted that the DASH model for implementing memory correctness significantly complicates programming and cache coherency.
A bus-based snoopy scheme, as known in the art, is used to keep caches coherent within a node on the DASH system, while inter-node cache consistency is maintained using directory memories to effect a distributed directory-based coherence protocol. In DASH, each processing node has a directory memory corresponding to its portion of the shared physical memory. For each memory block, the directory memory stores the identities of all remote nodes caching that block. Using the directory memory, a node writing a location can send point-to-point invalidation or update messages to those processors that are actually caching that block. This is in contrast to the invalidating broadcast required by the snoopy protocol. The scalability of DASH depends on this ability to avoid broadcasts on an inter-node basis.
The DASH architecture relies on the point-to-point invalidation or update mechanism to send messages to processors that are caching data that needs to be updated. All coherence operations, e.g. invalidates and updates, are issued point-to-point, sequentially, and must be positively acknowledged in a sequential manner by each of the remote processors before the issuing processor can proceed with an operation. This DASH implementation significantly negatively affects performance and commercial applicability. As acknowledged in the above-referenced publication describing DASH, serialization in the invalidate mechanism negatively affects performance by increasing queuing delays and thus the latency of memory requests.
DASH provides "fences" which can be placed by software to stall processors until pending memory operations have been completed, or which can be implemented to delay write operations until the completion of a pending write. The DASH CCNUMA architecture generally presents an environment wherein a significant burden is placed on software developers to ensure the protection and consistency of data available to the multiple processors in the system.
The DASH architecture, and more specifically the memory consistency and cache coherency mechanisms also disadvantageously introduce opportunities for livelock and deadlock situations which may, respectively, significantly delay or terminally lock processor computational progress. The multiple processors in DASH are interconnected at the hardware level by two mesh networks, one to handle incoming messages, and the other to handle outgoing communications. However, the consumption of an incoming message may require the generation of an outgoing message, which can result in circular dependencies between limited buffers in two or more nodes, which can cause deadlock.
DASH further dedicates the meshes for particular service: the first mesh to handle communications classified as request messages, e.g. read and read-exclusive requests and invalidation requests, and the second mesh to handle reply messages, e.g. read and read-exclusive replies and invalidation acknowledges, in an effort to eliminate request-reply circular dependencies. However, request-request circular dependencies still present a potential problem, which is provided for in the DASH implementation by increasing the size of input and output FIFOs, which does not necessarily solve the problem but may make it occur less frequently. The DASH architecture also includes a time-out mechanism that does not work to avoid deadlocks, but merely accommodates deadlocks by breaking them after a selected time period. Although the DASH implementation includes some hardware and protocol features aimed at eliminating processor deadlocks, heavy reliance on software for memory consistency, and hardware implementations that require express acknowledgements and incorporate various retry mechanisms, presents an environment wherein circular dependencies can easily develop. Accordingly, forward progress is not optimized for in the DASH CCNUMA architecture.
The CCNUMA architecture is implemented in a commercial multiprocessor in a Sequent Computer Systems, Inc. machine referred to as "Sting" which is described in STING: A CCNUMA Computer System for the Commercial Marketplace, L. Lovett and R. Clapp, ISCA '96, May 1996 incorporated herein by reference. The Sting architecture is based on a collection of nodes consisting of complete Standardized High Volume (SHV), four processor SMP machines, each containing processors, caches, memories and I/O busses. Intra-processor cache coherency is maintained by a standard snoopy cache protocol, as known in the art. The SHVs are configured with a "bridge board" that interconnects the local busses of plural nodes and provides a remote cache which maintains copies of blocks fetched from remote memories. The bridge board interfaces the caches and memories on the local node with caches and memories on remote nodes. Inter-node cache coherency is managed via a directory based cache protocol, based on the Scalable Coherent Interface (SCI) specification, IEEE 1396. The SCI protocol, as known in the art, is implemented via a commercially available device that provides a linked list and packet level protocol for an SCI network. The chip includes FIFO buffers and Send and Receive queues. Incoming packets are routed onto appropriate Receive queues, while the Send queues hold request and response packets waiting to be inserted on an output link. Packets remain on the Send queues awaiting a positive acknowledgement or "positive" echo from the destination as an indication that the destination has room to accept the packet. If the destination does not have queue space to accept a packet, a negative echo is returned and subsequent attempts are made to send the packet using an SCI retry protocol.
The linked list implementation of the SCI based coherency mechanism presents a disadvantage in that the links must be traversed in a sequential or serial manner, which negatively impacts the speed at which packets are sent and received. The retry mechanism has the potential to create circular redundancies that can result in livelock or deadlock situations. The linked list implementation also disadvantageously requires significant amounts of memory, in this remote cache memory, to store forward and backpointers necessary to effect the list.
Machines based on CCNUMA architecture presently known in the art do not take into consideration to any great extent respective workloads of each of the multiple processors as the machines are scaled up, i.e. as more processors or nodes are added. Disadvantageously, as more processors are added in known CCNUMA multiprocessors, limited, if any, efforts are made to ensure that processing is balanced among the job processors sharing processing tasks. Moreover, in such systems, when related tasks are distributed across multiple nodes for processing, related data needed for processing tends to be spread across the system as well, resulting in an undesirably high level of data swapping in and out of system caches.
Methods and operating systems are known for improving efficiency of operation in multiprocessor systems by improving affinity of related tasks and data with a group of processors for processing with reduced overhead, such as described in commonly assigned U.S. patent application Ser. No. 08/187,665, filed Jan. 26, 1994, which is hereby incorporated herein by reference. Further, as described in commonly assigned U.S. patent application Ser. No. 08/494,357, filed Jun. 23, 1995, which is incorporated herein by reference, mechanisms are known for supporting memory migration and seamless integration of various memory resources of a NUMA multiprocessing system. However, known CCNUMA machines generally do not incorporate mechanisms in their architectures for such improvements in load balancing and scheduling.