The present invention relates to a system and method for providing redundant computing, and more specifically, to a system and method for providing a plurality of state-compatible computers.
Throughout the history of computing, it has been desirable to develop systems for reliable data processing. In other words, it is desirable to design a computing system that can prevent the loss of data, and can recover to a known state when a failure occurs.
Reliable Data Storage
One method of fault recovery is a backup, namely, storing all of the computer""s data and programs on another media so that in the event of a failure, the computer can be restored to the state of the last backup. While almost every computer system with mission critical data is backed up, the use of backups for fault recovery is fraught with imperfections.
One method of making a backup is to close off access to the data being backed up, to copy all of the data to some form of mass storage, and then to again permit access to the now-backed up data. This is referred to as a full backup. A full backup made in this way, obviously, requires that for some period of time, namely, the time it takes to back up the computer, all users of the computer are denied access. For safety, the media on which the full backup is stored would preferably be transmitted or stored at a remote location.
One drawback of the full backup method is that the users may be denied access at times that are inconvenient. Another drawback results from the fact that in the event of a failure, all data created since the last full backup is lost. Thus, the administrator must constantly choose between the xe2x80x9ccostxe2x80x9d of lost data on the one hand, and the frequency of backing up the computer and the time required to backup the system (during which users will be blocked) on the other; constantly trading one against the other. Furthermore, in the event of a failure, all of the users must stop using the system, then the most recent backup copy must be located, retrieved and restored. Then the users may resume using the system, but all of their data created since the last full backup is lost.
A modification on the full backup theme is making periodic full backups, and between the full backups, making incremental backups of the data that changed since the previous full or incremental backup. Since the incremental backup generally requires substantially less time to perform than the full backup, the users are blocked from access for a shorter period. Despite this, the users are still blocked for a time from access, but a significant drawback arises in restoring the data, because an ordered xe2x80x9csetxe2x80x9d of tapes now must be located, retrieved and restored in the precise order they were produced.
Another technology designed to decrease the computer""s downtime due to a fault is the RAID storage array. A RAID storage array provides fault tolerance in that if a hard drive in the array fails, it can be removed and automatically rebuilt. More advanced RAID arrays include a spare, and begin to build the spare as a replication of a disk that is predicted to fail. Using certain criteria, drives can often be replaced long in advance of any actual failure. Moreover, the RAID array can xe2x80x9chot swapxe2x80x9d meaning that a failed drive can be removed even when the system is running, thus preventing downtime due to a component failure. Despite its abilities, however, RAID arrays have a single point of connection, and thus are vulnerable to failing as a whole.
Another technology that prevents the single point of failure is mirroring. A mirrored drive has a counterpart drive that is constantly fed the same transactions, permitting the mirroring drive to remain an identical image of the mirrored drive. In the event that the mirrored drive were to fail, the mirroring drive can be used instead. Mirroring does present drawbacks. For example, where the two mirrored drives disagree and neither has apparently failed, there is no RAID-like redundancy to arbitrate the data inconsistency. Of course, it may be possible to mirror RAID arrays. Even so, if the host computer fails, the storage unit may have preserved a mid-operation state from which the host computer cannot resume.
A newer technology trying to address some of these drawbacks are systems that use mirrored drives. The system periodically interrupts the mirror function while it creates a backup of the mirroring drive. When it is complete, it uses a replication technique to restore the mirror relationship. Vendors such as Hitachi, EMC and IBM have data storage products that provide for synchronous and asynchronous replication and mirroring. During the period that the mirror relationship is broken, however, the system still has a single point of failure, and is subject to data loss.
Accordingly, while each of the foregoing technologies provides varying levels of cost and reliability for data storage, each have substantial drawbacks, including causing access denial and delay after failure, having a single point of failure, and/or preserving a state from which a restarted computer cannot resume.
As many techniques have come about for supposed reliable data storage, many have come about that purport to provide reliable computing. The problem of reliable hosting deals with the incident when the host computer fails. Reliable hosting falls generally into two main categories of techniques, namely failover and clustering.
The technique of failover is accomplished by having two generally identical computers, having only one use its operational power. If that one fails, the second computer takes over, ostensibly from where the first left off. This event is known as failover.
The technique of clustering, on the other hand, is accomplished by having multiple machines, each machine in the cluster performs its individual tasks (sometimes the same tasks), but each has additional capacity to handle the tasks of one or more failed machines in the cluster. While clustering can prevent downtime from a failure because other processing power can be almost immediately substituted for the system that went down, it does not inherently failoverxe2x80x94meaning that the applications on the failed component need to be restarted on another component, and thus, some data is usually lost.
There are also some systems that combine features of failover and clustering systems.
In one hybrid failover/clustering architecture, two machines are used, and they must be and remain in substantially identical statesxe2x80x94in other words, not only the disks must be mirrored, but the memory as well. Such a system is very costly, and requires a master controller to operate. One example of such a failover system is called the IBM Sysplex. The Sysplex architecture has a master controller connected to two identical computers. The two computers synchronize to the master controller""s clock (or heartbeat). As memory is changed in one machine, it gets time-stamped and shipped to the other. The master controller must therefore arbitrate memory locks and facilitate the transfer. Such a system, of course, still has a single point of failure, namely, the master controller.
Another technique designed to provide reliable hosting is the DEC VAX shared disk architecture. A number of VAX computers share a storage system. This architecture allows for processing scalability as well as some amount of failover redundancy, for example, because a task has access to the same data when it is run on any machine in the cluster.
A major drawback of clustering is that both the application running in the clustered environment, and the operating system need to be cluster-aware and need to behave in certain manners dictated by the cluster.
Today, many operating systems are cluster-aware and can also provide one form or another of failover, for example, Novell, HP-UX and IBM AIX can all have a cluster with failover between identical computer systems. Still other proprietary systems exist that provide some type of clustering and/or failover, including an IBM product called NUMAQ (which had been originally developed by Sequent), where processes are distributed over multiple processors, VIA, also known as Virtual Interface Architecture, and Wolfpack, a Microsoft product.
The effect of a system failure at many large companies and banks would be catastrophic. It has been suggested that it would take many organizations one, two or more days to recover from a significant failure, if such recovery were even possible. There are presently no systems that can provide clusters with failover between non-homogenous computing systems. Moreover, there is a need for a computing environment that has a user-selectable reliability.
A fault-tolerant, state-compatible computing system and method have a user interface and a data center. The user interface accesses a first network, and the data center is connected to the first network by a plurality of communication links. The data center includes an extreme server; a plurality of application servers capable of running at least one application; and a second network having two or more communication links between the extreme server and the application servers. Using the communication protocols of the first network, the user interface is able to operate a first application executing on a first application server; and the extreme server maintains a second application executing on a second application server in a state-compatible condition to be used in the event that the user interface is no longer able to operate the first application.