Network servers coupled with client computing devices are increasingly being arranged to support or host containers that enable a single operating system to support multiple isolated systems (“containers”) on a single hosting server or hosting computing platform. These may be arranged to execute applications responsive to receiving requests from one or more clients. Also, when high availability is desired for applications executed by containers hosted by servers, a primary container (PCN) and a secondary container (SCN) may each be hosted on separate servers or nodes (e.g., within a data center) and their operational states may be replicated. This replication of operational states may provide for an application-agnostic, software-implemented hardware fault tolerance solution for “non-stop-service”. The fault tolerance solution may allow for the SCN to take over (failover) when the server hosting the PCN suffers a hardware failure and/or the PCN enters a fail state.
Lock-stepping is a fault tolerance solution that may replicate PCN/SCN operational states per instruction. For example, PCN and SCN may execute an application in parallel for deterministic instructions, but lock-step for non-deterministic instructions. However, lock-stepping may suffer from very large overhead when dealing with multiprocessor (MP) implementations, where each memory access might be non-deterministic.
Checkpointing is another fault tolerance solution that replicates a PCN operational state to an operational state of the SCN at periodic epochs. For checkpointing, in order to guarantee a successful failover, all output packets may need to be buffered until a successful checkpoint action has been completed. Buffering until a successful checkpoint action is complete in a container environment may lead to extra network latency and overhead due to output packet buffering and frequent checkpoint actions.
COarse-grain LOck-stepping (COLO) is yet another fault tolerance solution that has both PCN and SCN being fed with a same request/data (input) network packets from a client. Logic and/or features supporting COLO may be capable of monitoring output responses of the PCN and SCN and consider the SCN's operational state as a valid replica of the PCN's operational state as long as network responses (output) generated by the SCN match that of the PCN. If a given network response does not match, transmission of the network response to the client is withheld until the PCN operational state has been synchronized (force a new checkpoint action) to the SCN operational state. Hence, this type of COLO procedure may ensure that a fault tolerant system is highly available via failover to the SCN. This high availability may exist even though non-determinism may mean that the SCN's internal operational state may be momentarily different to that of the PCN's operational state. The SCN's operational state may appear equally valid and remains consistent from the point of view of external observers to the fault tolerant system that implements a COLO procedure. Thus, COLO procedures may have advantages over pure lock-stepping or checkpointing fault tolerance solutions.
COLO fault tolerance solutions may take advantage of such protocols as those associated with the transport control protocol (TCP) stack. The TCP stack may be arranged to have an operational state per-TCP connection between applications at a server and a client and may be capable of recovering from packet loss and/or packet re-ordering. However, unlike virtual machine implementations, outputted response packets for PCNs and SCNs may not need to rely on comparison of outputted response packets that may be processed through a TCP stack.