1. Field of the Invention
This invention relates to computing network systems, and more particularly, to increasing the reliability and availability of a network system.
2. Description of the Relevant Art
High performance computing is often obtained by using high-end servers. In other cases, clusters of multi-processor nodes may be coupled via a network to provide high performance computing. In some cases, a cluster of nodes may have a lower financial cost of a high-end server. However, clusters of multi-processor nodes may lack the availability of high-end server based systems. Consequently, one method to increase the availability of a cluster of multi-processor nodes is memory replication.
Memory replication generally includes maintaining one or more copies of a memory state in the cluster of nodes. One embodiment of the memory state may be the data content of memory and the processor architectural state content during the execution of an application. The memory state may need to be periodically updated in each copy in order to synchronize the copies with one another and with the original copy. If an executing application experiences a fault, the application can be restarted on another processor in another node and the memory state is recovered from the copy of memory state in this particular node. One method of maintaining memory replication for higher availability is by use of software techniques. However, software techniques involve significant overhead and thus, incorporate a performance penalty and scalability limits. Accordingly, efficient methods and mechanisms for managing clusters of computing nodes are desired.