1. Field of the Invention
The present invention generally relates to a system of inter-connected computers implementing redundant computer option in which one computer is backup to another computer and automatically becomes active upon failure of the other computer. More particularly, the invention relates to embedding executable code in a mass storage device controller in a server computer to permit a backup server to detect the failure of a primary server and boot up to take over the function of the failed primary server.
2. Background of the Invention
Server computers can be coupled together into a larger system in a variety of configurations. One such configuration includes a pair of server computers in which one server functions as a primary server and the other server is a backup (or “slave”) server. This configuration is shown in FIG. 1. Two servers 52a and 52b are shown having a shared connection to a storage array 56. Server 52a may function as the primary server, while server 52b is the slave. Each server includes a mass storage device controller 58a and 58b which provides an interface to the storage array 56. The controllers 58 typically comprise circuit cards which care inserted into the servers. As such, the controller cards can easily be replaced as, for example, upgraded cards become available. Each server also includes a system read only memory (“ROM”) 60a, 60b which contains code executed by a central processing unit (“CPU”) (not specifically shown). Further still, the servers are interconnected by an asynchronous communication port 54 which permits the slave server to detect when the primary has failed. One of ordinary skill in the art will recognize that many other components are provided in the servers.
When the servers 52a, 52b are powered on, the primary server 52a performs its power on self test (“POST”) (the process that runs from the time power is applied until boot up) and then completes the boot process to begin run-time execution. The slave server 52b generally performs its POST, but does not complete the boot process. Instead, as explained below the slave server monitors the communication port 54 to determine whether the primary server has failed. When the primary fails, the slave completes the boot process and takes over run-time execution.
The communication port 54 is used by the slave to detect when the primary has failed. The primary server 52a sends a “heartbeat” signal over the communication port 54 to the slave server 52b in accordance with a predetermined protocol and at a predetermined period (e.g., once per minute). The slave server 52b polls the communication port 54 for the heartbeat signals. If the primary server 52a fails to send a heartbeat signal, the slave server 52b will detect the lack of receipt of the heartbeat, determine that the primary server has failed and respond accordingly. The slave's response entails a number of activities including configuring the connection with the storage array 56 and completing the boot process.
The code that the slave server 52b runs to cause it to stall during the initialization process, detect whether the primary server has failed and complete the boot process is generally part of the slave's system ROM 60b. That code is specifically shown in FIG. 1 as the redundant server option (“RSO”) code 62b. This configuration, in which the system ROM contains the RSO code, has several deficiencies. For instance, the RSO code communicates with the controller card 58b and accordingly is specific to that particular controller. Because there are a variety of different controllers 58b currently available, the system ROMs must include RSO code that can communicate with any such controller complicating the system ROM code. Further, if a new controller 58b becomes available, new system ROM code must be developed and tested to include RSO support for the new controller(s). This requires significant development effort, time and cost. Also, it requires the operator of the server to “reflash” the system ROM in every server in which new controllers are installed to update the system ROM code. Many companies have numerous servers (e.g., hundreds) and reflashing every system ROM can be a very labor intensive, time consuming, and thus undesirable effort. Thus, developing a new mass storage device controller, therefore, causes new system ROM code to be developed and deployed to support the new controller in a server configuration which includes a standby, redundant server.
A redundant server configuration is needed which has little or no impact on system ROM code when new mass storage device controllers are introduced into the marketplace. Despite the advantages such a system would provide, no such system is known to exist to date.