This invention relates to fault tolerant computing, and in particular to software fault tolerant computing.
Many different approaches to fault-tolerant computing are known in the art. Fault tolerant computing is typically based on providing replication of components and ensuring for equivalent operation between the components. A brief outline of the advantages and disadvantages of some of the known choices is given below.
A fault-tolerant mechanisms can be implemented by replicating hardware, for example by providing multiple processors with the same software operating on each of the processors. The replicated software is arranged to operate in lockstep during normal operation and a mechanism is provided to detect a failure of lockstep. The advantages of such an approach are fast detection and masking of failures, with fault-tolerance which is transparent to software. There are also some disadvantages of such systems. For example, they are difficult to develop and upgrade. Also, they inherently have xe2x80x9cwastedxe2x80x9d hardware resources. Moreover, the system does not tolerate software failures. Also, as very restricted knowledge about the software is available to the fault tolerant mechanisms, this may cause some inefficiency, for example it is difficult to know precisely which parts of memory have to be saved and/or restored with the result that conservative decisions are made about the necessary actions to be performed.
Similar advantages and disadvantages exist where fault tolerant mechanisms are implemented within a hypervisor (a software layer between hardware and OS), or even within an OS. Although more knowledge about software applications may be available in these cases, it is still very restricted, and a fault at the application level can still cause correlated failures of all the replicated units, which cannot be detected/masked by the fault tolerant mechanisms.
When the chosen fault tolerant mechanisms are placed in user-space, but below the applications (for example, in libraries, daemons, etc.) they are easier to implement and have increased coverage of software failures. There are disadvantages inherent in such approaches as well. For example, potential inefficiencies are related to failure detection and masking. There are higher overheads in normal operation. Also, only partial transparency for the applications is normally provided. The level of transparency varies between different approaches. For example, they often force the users to use a particular programming paradigm, which may not be the most appropriate for some applications.
Fault tolerant mechanism can also be implemented in applications. This gives the fault tolerant mechanisms full knowledge of the applications, but with the loss of any transparency. Moreover, such mechanisms are not reusable, and it is hard to make them efficient and reliable each time (for every specific application).
An example of a re-usable, user-level approach to software fault-tolerance is described in an article entitled xe2x80x9cTFT: A software system for Application Transparent Fault Tolerancexe2x80x9d by T. C. Bressoud from xe2x80x9cThe 28th Annual International Symposium on Fault-Tolerant Computing, JunE 1998xe2x80x9d. The article describes an arrangement of a software layer (Transparent Fault Tolerance layer, or TFT layer) between an operating system and applications that implements a fault tolerant mechanism. This is based on an earlier work by the same author entitled xe2x80x9cBuilding a Virtually Fault Tolerant Systemxe2x80x9d, PhD Cornell University, May 1996, where the same approach for fault-tolerance was applied at the hypervisor level.
A TFT layer provides an interface that appears to an application to be identical to that of the underlying OS. The TFT layer implements primary-backup replication, resolves the input value non-determinism, asynchronous actions, and suppression of duplicate outputs. Failure detection is based on message acknowledgements and time-outs. TFT does not attempt to integrate failure detection and masking with the corresponding language-level constructs. The TFT layer intercepts system calls made by the applications and asynchronous exceptions raised by the OS, and after some processing, it decides whether to forward them to the OS or the application respectively. The non-deterministic system calls are performed only by a primary replica, which sends the results to the secondary replica. This solves the problem of non-deterministic input values.
In order to solve the problem of asynchronous actions raised by the operating system, TFT uses the concept of epochs. An epoch is a fixed-length sequence of actions excluding asynchronous actions. Computations by both primary replica and the backup replica are divided into the same sequence of epochs. The TFT layer is responsible for dividing computations into epochs and for co-ordinating the epochs of the primary replica and the backup replica. This is done using object code editing, whereby application binaries are modified adding the code for incrementing an epoch counter and for passing control to the TFT layer at epoch boundaries.
A similar technique for managing intervals of control flow is proposed in an article by J. H. Sly and E. N. Elnozahy, entitled xe2x80x9cSupporting Non-deterministic Execution in Fault-tolerant Systemsxe2x80x9d, from a Report CMU-CS-96-120, School of Computer Science Carnegie Mellon University, May 1996, and an article by J. H Sly and E. N. Elnozahy entitled xe2x80x9cSupport for Software Interrupts in Log-Based Rollback-Recoveryxe2x80x9d, from IEEE Transactions on Computers, Vol. 47, No. 10, October 1998.
Intercepted asynchronous actions are buffered locally by the primary replica, and are forwarded to the secondary replica. They are delivered in the same order at both primary and secondary and at the same points in the control flow, which is at the epoch boundary.
The backup replica can detect that the primary replica has failed when either it does not receive the result of a non-deterministic system call, or it does not receive an end of epoch message. In either case, the backup becomes the new primary and starts performing the non-deterministic system calls and delivering asynchronous actionsxe2x80x94at the epoch boundaries. At the promotion point there is some uncertainty about how far the old primary replica will have got in its computation before the failure happened. It might have performed some output actions, or received some asynchronous exceptions, and not have had time to communicate this to the backup. This can cause problems, as the failure now becomes non-transparent to the environment. In order to alleviate this problem the primary replica performs a xe2x80x9cstability queryxe2x80x9d immediately before performing any output action. This is a blocking operation that allows the primary to continue only when it is known that the backup has received all the previously sent messages. This however does not completely solve the problemxe2x80x94there is still some uncertainty about the last output action, and about possible asynchronous actions received by the old primary before it failed (note that such an action was possibly an acknowledgement of a previous output action). Depending on the semantics of the specific uncertain action, there may be a solution in some cases (specifically for idempotent actions and those actions that allow TFT to ask the environment about their status). In other cases the only solution is to return an error code to the application which should indicate that there is uncertainty about the action""s execution status.
Another interesting approach for software fault tolerance can be found at: www.omg.org/techprocess/meetings/schedule/Fault_Tolerance_RFP.html. This Internet site describes work in progress on a proposal for fault tolerant Corba (ftCorba) that allows for several kinds of replication (passive warm, passive cold, and active) for objects. Replicas are kept consistent and their state is updated despite asynchrony and failures. Object invocations and responses are contained in multicast messages that are totally ordered in a model of virtual synchrony. Also contained in these messages are state updates, failure notifications, and object group join and leave events. Applications can receive fault reports from the Failure Notification Service, but integration with the language-level support for failure detection and recovery (i.e., with exceptions) is limited, since exceptions are in general not channelled through a Failure Notification Service.
In passive replication, when the primary replica fails a new primary replica is elected and the most recent saved state of the old primary is applied to it (in warm replication this might have been done already). There is no support for virtualising and unvirtualising the input/output values. Also in passive replication, the passive replicas are dormant, if warm replication is used their state is updated, but otherwise they do not perform any actions.
In a related proposal by Eternal Systems Inc. and Sun Microsystems Inc. entitled xe2x80x9cProposal to OMG on Fault Tolerancexe2x80x9d, September 1998, a strong assumption is made that all application interactions with the application""s environment are done as object invocations/responses, and that they all go through the multicast engine. All the objects (their methods) are assumed to be deterministic. This model is generally not appropriate for interactions between an application and the operating system or various non-object-oriented libraries. Similarly, although the proposal does provide suppression of duplicate invocations and responses, this is not enough if there are interactions with non-Corba services. It can be seen that, despite their considerable complexity, the ftCorba proposals, in general, do not cope with input non-determinism, suppression of duplicate outputs, and asynchronous external actions.
The two reports entitled xe2x80x9cSomersault Software Fault Tolerancexe2x80x9d, Report HPL-98-06, HP Laboratories Bristol, January 1998 and xe2x80x9cSomersault: Enabling Fault Tolerant Distributed Software Systemsxe2x80x9d, Report HPL-98-81, HP Laboratories Bristol, by P Murray et al, describe Somersault, a library for providing increased availability to those applications that are required to be fault-tolerant. The implementation is based on a variant of primary-backup replication (the so-called primary-receiver secondary-sender approach) and is relying on a reliable communication mechanism between the replicas.
In Somersault, the primary replica does the non-deterministic events and forces the secondary replica to do them in the same way (with the same value, in the same order). This is achieved by passing messages from primary replica to the secondary replica through a log. Two kinds of events are distinguished: those initiated from outside (e.g., message received, timer expired, thread scheduled), and those initiated by the process (e.g., system calls). For the former, Somersault controls the order of delivery of these events to applications. For the latter, Somersault captures the result and injects it into the secondary replica. This is done with the application""s help, that is non-transparently. The only output actions allowed are message sends, and they have to go via Somersault.
If the primary replica fails, this will result in the loss of input links (from clients to primary replica) and some possible loss of messages that were in transit somewhere on the path: client-primary-secondary. The recovery procedure is then that the secondary replica has to reconnect and the remote side has to re-send (either of these may be non-transparent to clients). If the secondary replica fails, this will result in the loss of output links (from the secondary replica to client) and some possible loss of output messages. The recovery procedure is then that the primary replica has to reconnect and send messages from a re-send queue. Re-integration of a new secondary replica is done by state transfer and transfer of output links from the primary replica to the secondary replica. Applications provide save_state operations that are invoked by Somersault. There is no support for virtualisation of values.
Y. Huang and C. Kintala, in a work entitled xe2x80x9cSoftware Fault Tolerance in the Application Layerxe2x80x9d, chapter 10 in a book edited by M. R. Lyu entitled xe2x80x9cSoftware Fault Tolerancexe2x80x9d, Trends in Software series (3), John Wiley and Sons, 1995, describes support for software fault tolerance using primary-backup replication where a backup is passive until there is a take-over. There is support for checkpointing process state, logging of process messages, and replicated disk files. The framework performs failure detection using heartbeat messages. Recovery after a process failure consists of restoring the process state using the-last checkpointed state, and replaying the logged messages that are applicable to this state.
K. H. Kim, in a work entitled xe2x80x9cThe Distributed Recovery Block Schemexe2x80x9d, chapter 8 in the book edited by M. R. Lyu entitled xe2x80x9cSoftware Fault Tolerancexe2x80x9d, Trends in Software series (3), John Wiley and Sons, 1995, describes distributed recovery blocks (DRB) integrated with the technique known as xe2x80x9cpair of self-checking processing nodesxe2x80x9d (PSP). This work has some similarities with TFT, but assumes that all input arrives in the same order and with the same values, over a multicast network, to both primary and backup. In DRB, a computation is done by repeating a cycle of: input/computexe2x80x94andxe2x80x94test/output (multiple inputs/outputs are allowed in a single input/output phase respectively). The backup replica does not know what exactly a failed primary had done before failing. The primary replica has as its primary choice the first branch of the recovery block, while the backup replica has as its primary choice the second branch of the recovery block. It has been shown by F. Cristian in a work entitled xe2x80x9cException Handling and Tolerance of Software Faultsxe2x80x9d, chapter 4 in the book edited by M. R. Lyu entitled xe2x80x9cSoftware Fault Tolerancexe2x80x9d, Trends in Software series (3), John Wiley and Sons, 1995, that appropriately strengthened exception model can express the recovery block structure. Also, unlike recovery blocks, exceptions are supported by some of the main stream programming languages.
U.S. Pat. No. 5,805,790 (Nota et. al.) describes a fault recovery method for a multi-processor system including a number of real processors, a single host operating system and shared memory. Multiple virtual machines (VMs) execute on the real processors with the assignment of VMs to real processors being under the control of the host operating system. Optionally the real processors are partitioned into logical partitions by the host OS and are treated as independent computing units. The system aims to recover a VM from a failure of a processor, of a partition, or of a VM itself. However, to achieve this it requires the shared storage and a shared operating system and further requires hardware support for fault-detection and recovery, including fault recovery circuits. The method includes the setting of recovery attributes for failure of each of the VM machines. The method also includes the storage in a real machine save area of main storage by one of the fault recovery circuits of data and status information on the occurrence of a fault, and the subsequent retrieval the data and status information to recover from the fault.
An aim of the present invention is to provide an approach to fault tolerant computing that mitigates at least some of the disadvantages of the prior art.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Combinations of features from the dependent claims may be combined with features of the independent claims as appropriate and not merely as explicitly set out in the claims.
In accordance with one aspect of the invention, there is provided a fault tolerant computer system comprising a primary virtual machine (VM) and a secondary virtual machine (VM). The secondary virtual machine is operable to replicate operations of the primary virtual machine and the primary and the secondary virtual machines are further operable, or co-operate, mutually to provide fault tolerance.
An embodiment of the invention thus provides a new approach to providing software fault tolerance wherein both a unit of replication and a component that implements the fault tolerance mechanisms is a virtual machine (VM). An embodiment of the invention for providing transparent software fault tolerance can be described as xe2x80x9ca replicated virtual machinexe2x80x9d and will be referred to hereinafter as an xe2x80x9crVMxe2x80x9d. By replicating operations performed on the primary VM, the secondary VM can provide a replica of the primary VM. The primary and the secondary VMs co-operate to provide a mechanism for providing mutual fault tolerance (i.e. for providing fault tolerance between each other). For example, they can each be operable to test for equivalent operation of each other at predetermined stages of operation. With an embodiment of the invention, it is not necessary to provide a separate level of control, for example a common operating system with shared storage, to ensure fault tolerance as this is achieved by the replicated VMs themselves. Since a VM as used by the invention has full knowledge of the semantics of application-level code, fault tolerance mechanisms can be provided by the VMs without requiring any increase in application complexity.
An embodiment of the invention can enable co-ordination of replicated states and computations with some characteristics of both active and passive replication. Similar to active replication, the VM replicas can perform the same computation in parallel. However, the backup operations in the secondary VM replica can be delayed with respect to primary""s computation.
The present invention makes use of the high degree of integration with and knowledge about application code in a VM such as, for example, a Java VM. Further information about Java VM may be found, for example, in a book authored by T. Lindholm and F Yellin and entitled xe2x80x9cThe Java Virtual Machine Specificationxe2x80x9d, Addison Wesley, The Java Series 1999, the whole content of which is incorporated herein by reference. Such a VM forms a general interpretation and execution engine for application code. This execution engine has its instruction set and its own memory. It logically lies directly under the application code itself (i.e., there is no operating system (OS), or some other software layer between the application code and VM which executes this code). An embodiment of the invention takes advantage of the fact that a virtual machine has full knowledge of the semantics of application level code that is being executed. This allows a tight integration between the fault tolerance mechanisms and the application code. It also allows appropriate processing of the application-level instructions that are related to input (reading from the environment), output (writing to the environment) and control and management of external (synchronous and asynchronous) actions.
The primary virtual machine can be operated on a first processing engine and the secondary virtual machine can be operated on a second processing engine. An exchange of data is provided between the processing engines via a link. Each of the primary and secondary virtual machines is operable to send a heartbeat message to the other of the primary and secondary virtual machines at intervals via the link. The heartbeat message indicates that virtual machine which sends the heartbeat message is alive, and additionally can include status information.
A test for liveliness could be performed following receipt of a heartbeat message. Alternatively, or in addition, a test for liveliness can be performed in response to an input action. Alternatively, or in addition, a test for liveliness is performed at an epoch boundary, wherein an epoch boundary forms a boundary between sections of code executed by the virtual machines.
A virtual machine, which is found to be in a fault state, can be terminated. The primary virtual machine can be operable to initiate a new secondary virtual machine where an existing secondary virtual machine is found to be in a fault state. Where an existing primary VM is found to be in a fault state, a secondary VM is promoted to become the new primary.
It should be noted that an embodiment of the invention may have more than one backup VM.
The invention also provides a computer program product operable when run on a computer to provide a virtual machine for a redundant fault tolerant virtual machine architecture that includes a second virtual machine. The virtual machine is operable to form a replica of the other virtual machine by replicating operations performed on the other virtual machine. The virtual machine is further operable to test for equivalent operation of the other virtual machine at predetermined stages of operation. The computer program product can be provided on a carrier medium, for example a computer readable medium (e.g., a disc or tape or other computer readable storage or memory medium) or a data transmission medium (e.g., a telephone line, electromagnetic signal or other transmission medium).
The invention also provides a method of providing software fault tolerance comprising the provision of replicated virtual machines including at least a primary and a secondary virtual machine, wherein the secondary virtual machine replicates operations performed on the primary virtual machine, and the primary and the secondary virtual machines co-operate so as mutually to provide fault tolerance.
In an embodiment of the invention, transparent fault tolerance can be provided for applications executed by an rVM. The interface between applications and the rVM can be identical to the interface between the applications and a non-replicated VM.
Support can be provided for both applications that require strong internal and external consistency, and for applications with relaxed consistency requirements. Internal consistency requires that the states of the replicas are the same, or appear to be the same as seen from their environment. Relaxed internal consistency applies this rule to some part of the state of the replicas. External consistency requires that the interactions between the replicas and their environment appear as if performed by a non-replicated entity. Relaxed external consistency applies this rule to a subset of the interactions between the replicas and their environment. An embodiment of the invention can be suitable for applications that require some critical actions to be performed even in the presence of component failures. It is to be noted that such applications could not use a technique such as a transaction mechanism, e.g., transactions that are based on: detect failure, abort action, do backward recovery. Although it is sometimes suggested that a transaction mechanism provides fault tolerance, in fact it provides concurrency control (it can allow multiple read/write operations to proceed in parallel with the effects being equivalent to a serial execution of the operations), and guarantees that the results of the operations/transactions persist (on disk or similar). A transaction mechanism does not actually tolerate failures, but simply detects failures and rolls back to a previous consistent state of data.
The failure detection and masking mechanisms in an example of an rVM in accordance with the invention can be integrated with corresponding application-level language constructs. For example, language constructs such as exceptions (e.g., try-catch-throw in Java) are used in an embodiment of the invention. Transparent detection and recovery for some failures can be provided. However, an application may want to do some application specific processing of some failure notifications, and some failures allow only application-level recovery. Implementing the fault tolerance mechanisms at the VM level makes it possible to co-ordinate the tasks performed at this level with the similar tasks performed at the application level.