1. Field of the Invention
The invention relates to fault tolerant computer systems. More particularly, the invention relates to providing fault tolerant execution of application programs in a server network, by providing a method and system for executing an application program in a backup server if it is determined that a primary server, which normally executes the program, has failed.
2. Description of the Related Technology
As computer systems and networks become more complex and capital intensive, system failures which result in lost data and/or inaccessible applications have become unacceptable. In the computer industry, the reduction of computer failures and computer xe2x80x9cdowntimexe2x80x9d is a major focus for companies trying to achieve a competitive edge over their competitors. The reduction of downtime due to system failures and maintenance is critical to providing quality performance and product reliability to the users and buyers of computer systems. Particularly with respect to server computers which are accessed and utilized by many end users, the reduction of server downtime is an extremely desirable performance characteristic. This is especially true for users who depend on the server to obtain data and information in their daily business operations.
As servers become more powerful, they are also becoming more sophisticated and complex. A server is typically a central computer in a computer network which manages common data and application programs that may be accessed by other computers, otherwise known as xe2x80x9cworkstations,xe2x80x9d in the network. Server downtime, resulting from hardware or software faults or from repair and maintenance, continues to be a significant problem today. By one estimate, the cost of downtime in mission critical environments has risen to an annual total of $4.0 billion for U.S. businesses, with the average downtime event resulting in a $140 thousand loss in the retail industry and a $450 thousand loss in the securities industry. It has been reported that companies lose as much as $250 thousand in employee productivity for every 1% of computer downtime. With emerging internet, intranet and collaborative applications taking on more essential business roles every day, the cost of network server downtime will continue to spiral upward.
Various systems for promoting fault tolerance have been devised. To prevent network down time due to power failure, uninterruptible power supplies (UPS) are commonly used. Basically a rechargeable battery, a UPS provides insurance that a workstation or server will survive during even extended periods of power failures.
To prevent network downtime due to failure of a storage device, data mirroring was developed. Data mirroring provides for the storage of data on separate physical devices operating in parallel with respect to a file server. Duplicate data is stored on separate drives. Thus, when a single drive fails the data on the mirrored drive may still be accessed.
To prevent network downtime due to a failure of a print/file server, server mirroring has been developed. Server mirroring as it is currently implemented requires a primary server and storage device, a backup server and storage device, and a unified operating system linking the two. An example of a mirrored server product is the Software Fault Tolerance level 3 (SFT III) product by Novell Inc., 1555 North Technology Way, Orem, Utah, as an add-on to its NetWare(copyright) 4.x product. SFT III maintains servers in an identical state of data update. It separates hardware-related operating system (OS) functions on the mirrored servers so that a fault on one hardware platform does not affect the other. The server OS is designed to work in tandem with two servers. One server is designated as a primary server, and the other is a secondary server. The primary server is the main point of update; the secondary server is in a constant state of readiness to take over. Both servers receive all updates through a special link called a mirrored server link (MSL), which is dedicated to this purpose. The servers also communicate over the local area network (LAN) that they share in common, so that one knows if the other has failed even if the MSL has failed. When a failure occurs, the second server automatically takes over without interrupting communications in any user-detectable way. Each server monitors the other server""s NetWare Core Protocol (NCP) acknowledgments over the LAN to see that all the requests are serviced and that OSs are constantly maintained in a mirrored state.
When the primary server fails, the secondary server detects the failure and immediately takes over as the primary server. The failure is detected in one or both of two ways: the MSL link generates an error condition when no activity is noticed, or the servers communicate over the LAN, each one monitoring the other""s NCP acknowledgment. The primary server is simply the first server of the pair that is brought up. It then becomes the server used at all times and it processes all requests. When the primary server fails, the secondary server is immediately substituted as the primary server with identical configurations. The switch-over is handled entirely at the server end, and work continues without any perceivable interruption.
Power supply backup, data mirroring, and server mirroring all increase security against down time caused by a failed hardware component, but they all do so at considerable cost. Each of these schemes requires the additional expense and complexity of standby hardware, that is not used unless there is a failure in the network. Mirroring, while providing redundancy to allow recovery from failure, does not allow the redundant hardware to be used to improve cost/performance of the network.
What is needed is a fault tolerant system for computer networks that can provide all the functionality of UPS, disk mirroring, or server mirroring without the added cost and complexity of standby/additional hardware. What is needed is a fault tolerant system for computer networks which smoothly interfaces with existing network systems. Additionally, what is needed is a method or system of clustering application software programs which may be executed by servers within the network such that a software application being executed on a first server may be xe2x80x9cbacked-upxe2x80x9d, e.g., clustered, by a second server which continues execution of the application if for some reason the first server fails.
The invention addresses the above and other needs by providing a method and system for clustering software application programs which are executable by one or more servers in a server network.
In one embodiment, a system for fault tolerant execution of an application program in a server network, includes: a first server for executing the application program; a cluster network database, coupled to the first server; an object, stored in the cluster network database, which represents the program and contains information pertaining to the program; a failure detection module which detects a failure of the first server; a second server, coupled to the cluster network database; and a failover module which loads the application program in the second server upon detection of the failure of the first server, in accordance with the information contained in the object.
In another embodiment, a system for fault tolerant execution of an application program in a server network, includes: a first server for executing the application program; a cluster network database for storing objects therein; a cluster interface for prompting a system operator for information to be store in the objects, wherein the information comprises: a host server attribute which identifies which server is currently executing the program; a primary server attribute which identifies which server is primarily responsible for executing the program; and a backup server attribute which identifies which server is a backup server for executing the program if the primary server experiences a failure. The system further includes a second server, coupled to the database, for executing the program if the first server fails; a failure module which detects if the first server has failed; a failover module which executes the program in the second server if it is determined that the first server has failed, the failover module comprising: a backup status module which reads the backup server attribute in the object and determines whether the backup server attribute names the second server as the backup server; a backup loading module which loads the program in the second server if the backup server attribute names the second server as the backup server; a phoenix module which determines if the first server is once again operational; and a failback module which resumes execution of the program in the first server if it is determined that the first server is once again operational, the failback module comprising: a backup unload module which unloads the program from a random access memory in the second server; a verification module which verifies that the program has been unloaded from the second server; and a primary load module which loads the program in a random access memory in the first server after the program has been unloaded from the second server.
In a further embodiment, a system of loading and registering a software program in a cluster network database, coupled to a first server and a second server in a server network, includes: a primary load module which loads the program in the first server; a preregistration module which determines if the program was previously registered; an object module which creates an object for the program and stores the object in the database, if it is determined that the program was not previously registered; a nonfailure module which determines if a system operator previously unloaded the program and sets a host server attribute within the object to a value of null, if it is determined that a system operator previously unloaded the program; a phoenix detect module which determines if the first server is coming back from a failback process, if it is determined that the system operator did not previously unload the program; and a synchronization module which synchronizes all replicated databases within the network, if it is determined that the first server is not coming back from failback process.
In another embodiment, a system for fault tolerant execution of an application program in a server network, includes: a first server for executing the application program; a cluster network database, coupled to the first server; an object, stored in the cluster network database, which represents the program and contains information pertaining to the program; a failure detection module which detects a failure of the first server; a second server, coupled to the cluster network database; a reading module which reads the information from the object; and a failover module which loads the application program in the second server upon detection of the failure of the first server, in accordance with the information contained in the object.
In a further embodiment, a system for fault tolerant execution of an application program in a server network having a first and second server, includes: means for executing the application program in the first server; means for storing an object which represents the program in a cluster network database, wherein the object contains information pertaining to the program; means for detecting a failure of the first server; and means for executing the application program in the second server upon detection of the failure of the first server, in accordance with the information in the object.
In yet another embodiment, a system for fault tolerant execution of an application program in a server network having a first and second server, includes: means for executing the application program in the first server; means for storing an object which represents the program in a cluster network database, wherein the object contains information pertaining to the program; means for detecting a failure of the first server; means for reading the information contained in the object; and means for executing the application program in the second server upon detection of the failure of the first server, in accordance with the information in the object.
In another embodiment, a system for providing fault tolerant execution of an application program in a server network having a first and second server, includes: means for executing said application program in said first server; means for detecting a fault in the execution of said application program in said first server; and means for automatically, without operator intervention, executing said application program in said second server in response to said detecting step.
In a further embodiment, a system for providing fault tolerant execution of an application program in a server network having a first and second server, includes: means for executing said application program in said first server; means for detecting a fault in the first server; and means for automatically, without operator intervention, executing said application program in said second server in response to said detecting step.
In another embodiment, a system for providing fault tolerant execution of an application program in a server network having a first and second server, includes: means for executing said application program in said first server; means for detecting a failure of said first server to properly run said application; and means for automatically, without operator intervention, executing said application program in said second server in response to said detecting step.
In a further embodiment, a network server system, includes: a first server and a second server, each configured to execute a first application program; a first control module for causing said first server to execute said first application program when said first server is capable of executing said first application program; and a second control module for causing said second server to execute said first application program when said first server is incapable of executing said first application program.