This invention relates to computer systems, and more particularly to computer systems utilizing standby computers to provide back-up for an active computer.
Some computerized applications, such as those implementing a billing information system, require high operational reliability, because processing is ongoing and the input data is subject to frequent revision. For these applications, the availability of continuously functional hardware and the accurate backup of data is critical. To insure against data loss and protect against hardware failure, such applications are often implemented with a high-availability computer system. A high-availability computer system should function virtually all the time, or, at least, more often than normal computer hardware reliability factors typically allow.
To achieve the desired reliability, high-availability computer systems are known to use two computers running substantially in parallel (duplex systems). In a duplex system, one computer in the pair is active and performs the system""s processing and data handling functions, including the modification of existing data and the addition of new data. The other computer, which replicates the processing capabilities of the active computer and has access to the same (or equivalent) data, is maintained in a standby mode, ready to assume the active state in the event of a problem with the active computer. To effectively implement a transition from standby to active, all data available to the standby computer must be current, reflecting all changes made by the active system.
An illustrative case of a known duplex system is shown in FIG. 1. Computers 1 and 2 are connected via a network 5. The internal disks 3 and 4 on each computer 1 and 2, respectively, store the data for the system. One method for maintaining synchronized data in such a duplex system is writing the data to storage devices 3 and 4 in each computer, 1 and 2 respectively, at each processing step, i.e., whenever data is accessed, transferred or modified. The data for the system shown in FIG. 1 may be stored in replicated directories which reside on the internal disks 3 and 4. Any modifications made to files in a replicated directory on the active computer are mirrored to the same directory on the standby computer.
For example, when computer 1 is active, and data is written to a file, it is actually written to two files, one on disk 3 and one on disk 4. Each file has the same name and, if the system is working correctly, the files are identical. Mirroring is accomplished by sending commands across the network 5 to which both computers 1 and 2 are connected.
This method of replication results in disadvantageously long transitions and unreliable data back-up. Transitions are time consuming because the data replication function ties state transitions to system management. To invoke a transition without compromising data replication, the system manager (a software entity) must notify each application in the system of a change in system states. This notification is typically done in a prescribed sequence, and the system manager waits for a reply before notifying the next application in the sequence. Before sending the reply, the application completes its processing steps, which involves writing and replicating data. Replication, in turn, requires transporting information across the network 5, which takes time and creates an opportunity for data loss during transmission. This results in lengthy state transitions (e.g. standby takeover of active""s duties). Due to an application""s need for frequent and immediate access to data, a long takeover time creates an unreasonable risk of data loss.
The typical duplex system, as shown in FIG. 1, also provides no data back-up when the system is running simplex. Each computer (1 or 2) stores data to its internal disk (3 or 4), respectively. When one of the computers 1 or 2 stops, either due to a manual command or a failure, the remaining computer writes data to its internal disk. It is a distinct disadvantage of known high-availability systems that, in the simplex mode, no data back-up exists.
In accordance with the principles of the present invention, there is provided a system for monitoring and maintaining multiple computers operating substantially in parallel, each of which can assume an active state, a standby state or a stopped state. In the active state, the applications (software) residing on the computer are running and ready to accept and process data. In the standby state, certain applications are running, however, data is not accepted or processed. A primary function of a computer in the standby state is to monitor the other computers in the system and itself, and to assume an active state when necessary. In a stopped state, the applications responsible for processing data are not running. This state may be the result of a manual request entered by the operator or of a system failure.
Data storage for the system is accomplished with shared, external storage devices. Each computer has equal access to the shared storage device arrangement; however, only one computer may write to it at a time. The external storage devices are configured to mirror each other; that is, the physical disks are linked together as a single logical disk by a disk manager. To the computers, these mirrored pairs appear as a single disk. The disk manager keeps mirrored pairs identical: all data updates go to both disks. In the event that one member of the pair fails, the active computer continues to operate with the disk manager making use of the remaining functional disk. When the failed disk is replaced, the disk manager brings the new disk up to date with its partner. In addition, any number of disks may be used to meet the storage needs of the system. In an exemplary embodiment, each additional disk has a backup, creating mirrored pairs.
The computer states are controlled by a software implemented system manager which determines when a system state transition should occur and then invokes the transition. The system manager resides on each computer in the system and any system manager may take action. A transition determination is based upon the state of the data processing applications on each computer, the data processing applications on the other computers in the system, and the states of the external storage devices. When a system is running duplex, a copy of the system manager runs on each computer. The copy running on the standby system monitors the data processing applications on its partnerxe2x80x94the active system. If its partner becomes inactive, the system manager transitions the local (standby) system to active. The copy running on the active monitors the standby for a stopped state, in which case it issues a periodic alarm to warn the system administrator that the system is now running simplex (no backup).
In an exemplary embodiment, the system manager uses a software entity to query the states of applications running on its own (local) and other (remote) computers and the states of the external storage devices. The state information is returned to the system manager which takes action based upon predetermined state information criteria.