1. Field of the Invention
The present invention relates to a method and system for managing components in a networked computer system. In particular, the present invention relates to a method and system that dynamically allocates assignments and roles to components to maintain the availability of services provided by the components.
2. Discussion of the Related Art
Networked computer systems enable users to share resources and services. One computer can request and use resources or services provided by another computer. The computer requesting and using the resources or services provided by another computer is typically known as a client, and the computer providing resources or services to another computer is known as a server.
A group of independent network servers may be used to form a cluster. Servers in a cluster are organized so that they operate and appear to clients, as if they were a single unit. A cluster and its network may be designed to improve network capacity, by among other things, enabling the servers within a cluster to shift work in order to balance the load. By enabling one server to take over for another, a cluster may be used to enhance reliability and minimize downtime caused by an application or system failure.
Today, networked computer systems are used in many different aspects of our daily lives. They are used, for example, in business, government, education, entertainment, and communication. As the use of networked computer systems becomes more prevalent and our reliance on them increases, it has become increasingly more important to achieve the goal of always-on computer networks, or “high-availability” systems.
High-availability systems need to detect and recover from a failure in a way transparent to its users. For example, if a server in a high-availability system fails, the system must detect and recover from the failure with no or little impact on clients.
Various methods have been devised to achieve high availability in networked computer systems. For example, one method known as triple module redundancy, or “TMR,” is used to increase fault tolerance at the hardware level. Specifically, with TMR, three instances of the same hardware module concurrently execute and by comparing the results of the three hardware modules and using the majority results, one can detect the failure of the hardware modules. However, TMR does not detect and recover from the failure of software modules. Another method for achieving high availability is software replication, in which a software module that provides a service to a client is replicated on at least two different nodes in the system. While software replication overcomes some disadvantages of TMR, it suffers from its own problems, including the need for complex software protocols to ensure that all of the replicas have the same state.
The use of replication of hardware or software modules to achieve high-availability raises a number of new problems including management of replicated hardware and software modules. The management of replicas has become increasingly difficult and complex, especially if replication is done at individual software and hardware levels. Further, replication places a significant burden on system resources. Thus, there is a need for a system and method to efficiently manage replicas of software and hardware modules to achieve high availability.