This invention relates generally to servers and, particularly, to clusters or groups of servers that operate together.
Commonly, groups of servers are provided to execute complex tasks. Commonly server farms include large numbers of servers. These servers work together in either a peer-to-peer arrangement or in a variety of other hierarchies.
One type of clustered server is called a blade server. A blade server may be a thin module or electronic circuit board, usually for a single, dedicated application, such as serving web pages. A blade server is designed to be mounted in a blade server rack with a large number of other blade servers.
When any one of a large number of blade servers in a rack suffers failure in any of its components or its local disk, the failed blade server is simply considered a lost cause. A blade server may be taken off-line if it requires a boot from a failed local disk or network program load from a defective network interface card for its operating system loader. At this point, the failed blade server becomes inoperative hardware necessitating human interaction with the system in order to replace the component.
Moreover, the inability of one server to function may adversely impact the overall operation of the entire cluster of servers. Thus, the failure of even one server may have a relatively significant, if not catastrophic, impact on the overall operation of the cluster of servers.
Thus, there is a need for better ways to enable clusters of servers to handle defects that occur in one or more of the servers in the cluster.