1. Field of the Invention
The present invention relates to a method and system for retrieving the state of a failed CPU, cache, and Memory node to a nonfaulty location for failure recovery purposes.
2. Description of the Related Art
Prior to the present invention, conventional systems have been unable to adequately deal with a failed computer processor, cache, or memory node.
CPU/cache nodes can fail due to permanent hardware faults, transient hardware faults, operator errors, environmental errors, and software faults. Whenever such a fault occurs in a CPU node, it can no longer participate as an operational member of the system, and the state locked in its cache hierarchy must either be retrieved or reconstructed. This process is a critical part of error recovery.
Moreover, the technology advances in cache and memory technology have recently created a new type of problem in attempting to retrieve the state that is locked in the caches of a failed CPU/cache node.
Some state-of-the-art systems utilize an L3 cache which is a new, relatively large cache (e.g., 1 Gbyte or larger). Because the L3 cache in this type of system is quite large (1 GByte or larger), a huge amount of state data is stored there at the time of any fault, and it is quite difficult to efficiently reconstruct this state with the conventional techniques.
Indeed, the conventional systems have not even addressed such a problem yet since the L3 cache is relatively new and has been incorporated only into relatively new architectures. Indeed, in the past, the conventional systems have dealt only with relatively small caches (e.g., on the order of 100 Kbytes to 1 Mbyte). Thus, when such a cache (1000 times smaller than the new L3 cache) failed, a relatively small amount of data was lost and the recovery to obtain such data was commensurately smaller (e.g., 1000 times smaller). Thus, with the new cache, a significant amount of data is lost and a significant amount of work and time is involved in attempting to recover such an L3 cache and reload it to the system.
Hereinbelow, a simplified description will be provided of the failure recovery protocol that is necessary when the present invention is not employed.
A conventional solution to failure recovery is oriented around the assumption that when a CPU/cache node fails, all of its nonvolatile state (including that in memory) is lost and must be restored from disk. To make this possible, at some prior point in time (and periodically thereafter) the application stores a full copy of its entire state on disk. Thereafter, as state updates are made during the course of normal execution, these updates are also logged to disk. Thus, the disk contains information adequate to reconstruct the state of the application up to and including the last state update.
In the event of a failure, the state of the system is reconstructed by first loading the last full copy of its state that was saved to disk, and then applying the logged state updates to bring the system state up to date.
If a shared memory processor (SMP) instance contains large amounts of memory (e.g., 64 GB or larger), then the time required to bring a copy into memory from disk and then apply the updates can be several minutes. In addition, the system state is only as recent as the last update to disk. More recent updates will have been lost since they were not logged to disk.
Thus, prior to the invention, there was no efficient and adequate way to recover a failed computer processor, cache, or memory node having a large state size.
In view of the foregoing problems, drawbacks, and disadvantages of the conventional methods, it is an object of the present invention to provide a structure and method for the rapid recovery of the state of a failed CPU/cache/memory node in a distributed shared memory system.
In a first aspect, a method of (and system for) recovering a state of a failed node in a distributed shared memory system, includes directing a flush of data from a failed node, and flushing the data from the failed node to a memory node.
In a second aspect, a system for recovering a state of a failed node in a distributed shared memory system, includes a controller for directing a flush of data from a failed node, and a flush engine for flushing the data from the failed node to a memory node.
In a third aspect, a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method of recovering the state of a failed node in a distributed shared memory system, includes directing a flush of data from a failed node, and flushing the data from the failed node to a memory node.
With the unique and unobvious features and aspects of the invention, a method and system are provided which efficiently retrieve the state that is locked in the caches of a failed CPU/cache node, and especially a large size cache such as an L3 cache (e.g., 1 GByte or larger), such that the relatively large amount of state stored there at the time of the fault, can be efficiently reconstructed.