Network servers coupled with client computing devices are increasingly being arranged to support or host virtual machine(s) (VMs) that enable multiple operating systems and/or applications to be supported by a single computing platform. Also, when high availability is desired for servers hosting VMs, a primary VM (PVM) and a secondary VM (SVM) may each be hosted on separate servers or nodes (e.g., within a data center) and their states may be replicated. This replication of states may provide for an application-agnostic, software-implemented hardware fault tolerance solution for “non-stop-service”. The fault tolerance solution may allow for the SVM to take over (failover) when the server hosting the PVM suffers a hardware failure and/or the PVM enters a fail state.
Lock-stepping is a fault tolerance solution that may replicate VM states per instruction. For example, PVM and SVM execute in parallel for deterministic instructions, but lock-step for non-deterministic instructions. However, lock-stepping may suffer from very large overhead when dealing with multiprocessor (MP) implementations, where each memory access might be non-deterministic.
Checkpointing is another fault tolerance solution that replicates a PVM state to the SVM at periodic epochs. For checkpointing, in order to guarantee a successful failover, all output packets may need to be buffered until a successful checkpoint has been completed. Buffering until a successful checkpoint in a VM environment may lead to extra network latency and overhead due to output packet buffering and frequent checkpoints.
COarse-grain LOck-stepping (COLO) is yet another fault tolerance solution that has both PVM and SVM being fed with a same request/data (input) network packets from a client. Logic supporting COLO may be capable of monitoring output responses of the PVM and SVM and consider the SVM's state as a valid replica of the PVM's state, as long as network responses (output) generated by the SVM match that of the PVM. If a given network response does not match, transmission of the network response to the client is withheld until the PVM state has been synchronized (force a new checkpoint) to the SVM state. Hence, COLO may ensure that a fault tolerant system is highly available via failover to the SVM. This high availability may exist even though non-determinism may mean that the SVM's internal state is different to that of the PVM, the SVM is equally valid and remains consistent from the point of view of external observers to the fault tolerant system that implements COLO. Thus, COLO may have advantages over pure lock-stepping or checkpointing fault tolerance solutions.
COLO fault tolerance solutions may take advantage of such protocols as those associated with the transport control protocol (TCP) stack. The TCP stack may be arranged to have a state per connection and may be capable of recovering from packet loss and/or packet re-ordering. COLO may include use of a per-TCP connection response packet comparison. The per-TCP connection response packet comparison may consider an SVM state as a valid replica if response packets of each TCP connection outputted from the PVM match response packets of each TCP connection outputted from the SVM. This matching is regardless of possible packet ordering across TCP connections.