Large-scale shared-memory multi-processor computer systems typically have a large number of processing nodes (e.g., with one or more processors and local memory) that cooperate to perform a common task. For example, selected nodes on a multi-processor computer system may cooperate to multiply a complex matrix. To do this in a rapid and efficient manner, such computer systems typically divide the task into discrete parts that each are executed by one or more of the nodes.
When dividing a task, the nodes often share data. To that end, the processors within the nodes each may access the memory of many of the other nodes. Those other processors could be in the same node, or in different nodes. For example, a microprocessor may retrieve data from memory of another node (the data's “home node”). Accordingly, rather than retrieving the data from the home node each time it is needed, the requesting microprocessor, as well as other processors, may access their locally held copies (cached copies) to execute their local functions.
Problems arise, however, when the data that was retrieved and held by some other microprocessor changes, and the other microprocessor has not been notified of that change. When that happens, the locally held data may no longer be accurate, potentially corrupting operations that rely upon the retrieved data. To mitigate these problems, computer systems that share data in this manner typically execute cache coherence protocols to ensure that locally held copies of the data are consistent with the data at the home node. These protocols generally require passing coherence messages from the home node to remote nodes.
It is desirable to be able to remove a node from such a system without having to reboot or power down the system. For example, it may be useful to replace or “hot swap” defective hardware, or to dedicate the node to performing a different shared computation. The shared computations may require a very long time to execute, and their execution would be interrupted due to a reboot or power cycle. However, cache coherence protocols generally assume that the remote nodes are always present in the system, so an attempt to remove a node from a currently operating system will result in errors being generated, either in the hardware or by any executing software.