The disclosure is generally directed to checkpoint-based high-availability systems and, more particularly, to logging addresses of high-availability data via a non-blocking channel.
Computing may be thought of in terms of an application and a supporting platform. A supporting platform typically includes a hardware infrastructure of one or more processor cores, input/output, memory, and fixed storage (the combination of which supports an operating system (OS), which in turn supports one or more applications). Applications may be thought of as self-contained bundles of logic that rely on core object files and related resource files. As computing has become integral to modern industry, applications have become co-dependent on the presence of other applications. That is, a requisite environment for an application includes not only an underlying OS and supporting hardware platform, but also other key applications.
Key applications may include application servers, database management servers, collaboration servers, and communicative logic commonly referred to as middleware. Given the complexity of application and platform interoperability, different combinations of applications executing in a single hardware platform can demonstrate differing degrees of performance and stability. Virtualization technology interjects a layer between a supporting platform and executing applications. From the perspective of business continuity and disaster recovery, virtualization provides the inherent advantage of environment portability. For example, moving an entire environment configured with multiple different applications may be as simple as moving a virtual image from one supporting hardware platform to another.
In general, more powerful computing environments can support the coexistence of multiple different virtual images while maintaining a virtual separation between the images. Consequently, a failure condition in one virtual image typically cannot jeopardize the integrity of other co-executing virtual images in the same hardware platform. A virtual machine monitor (VMM) or hypervisor manages the interaction between each virtual image and underlying resources provided by a hardware platform. A bare metal hypervisor runs directly on the hardware platform similar to how an OS runs directly on hardware. In contrast, a hosted hypervisor runs within a host OS. In either case, a hypervisor can support the operation of different guest OS images or virtual machine (VM) images. The number of VM images is limited only by the processing resources of a VM container that holds the VM images or the hardware platform.
Virtualization has proven especially useful for end-users that require separate computing environments for different types of applications that are deployed on a single hardware platform. For example, a primary OS native to one type of hardware platform may provide a virtualized guest OS that is native to a different hardware platform (so that applications requiring the presence of the guest OS can co-exist with other applications requiring the presence of the primary OS). In this case, an end-user is not required to provide separate computing environments to support different types of applications. That is, irrespective of the guest OS, access to underlying resources of the single hardware platform remains static.
Virtualized environments have been deployed to aggregate different interdependent applications in different VMs in composing application solutions. For example, an application server can execute within one VM while a database management server executes in a different VM and a web server executes in yet another VM. Each of the VMs can be communicatively coupled to one another in a secure network and any given deployment of the applications can be live migrated to a different deployment without interfering with the execution of the other applications in the other VMs. In a typical live migration, a VM can be moved from one host server to another host server in order to, for example, permit server maintenance or to permit an improvement in hardware support for the VM.
Checkpoint-based high-availability (HA) is a technique in which a VM running on a primary host machine mirrors its processor and memory state every period (e.g., 25 ms) onto a secondary host machine. The mirroring process typically includes: tracking changes to the memory and processor state of the primary VM; periodically stopping the primary VM; sending the changes over a network to the secondary host machine; waiting for the secondary host machine to acknowledge receipt of the memory and processor state update; and resuming the primary VM. The mirroring process ensures that the secondary host machine is able to resume the workload with minimal loss of service should the primary host machine suffer a sudden hardware failure. If the secondary host machine either detects that the primary host machine is not responding or receives an explicit notification from the primary host machine, the secondary host machine starts the mirrored version of the VM and the appearance to the outside world is that the VM seamlessly continued to execute across the failure of the primary host machine.
Although the checkpoint-based HA technique provides effective protection against hardware failure, the checkpoint-based HA technique does not protect against software failure. Because the state of the processor and memory of the primary VM is faithfully reproduced on the secondary host machine, if a software crash (for example, the de-reference of a null pointer) causes a failover to the secondary host machine, the VM resumes execution from the last checkpoint and, if the program execution is deterministic, the same error will occur. There are some constrained cases in which a VM may not crash if software failure triggered a failover. However, these cases are rare and rely more on luck than design. For example, a software bug that manifested as a race condition in which one processor could access data that was being modified by another processor might not occur when the workload was resumed on the secondary host machine, as by a fluke of scheduling the data may not end up being concurrently accessed.