Server-based application programs, often run at data centers run by large “cloud” infrastructure providers, implement a variety of services in common use by millions of people every day, from e-commerce related services to social media applications to e-government portals. The complexity of the application programs, which in turn is at least partially correlated with the extent to which various distributed and decentralized computation models are used, has grown substantially in recent years. Very large amounts of data are handled in these complex application environments, and their data set sizes continue to grow rapidly. The task of managing such applications, including implementing efficient recovery techniques to respond to the failures that are inevitably experienced from time to time in large scale information technology infrastructures, has become increasingly difficult.
In order to increase the overall reliability of application programs, various technologies have been developed over the years to recover more quickly from service interruptions. One way to minimize these service interruptions is to periodically save application and/or operating system state information in a persistent repository, and to read the state information from the repository to recover the state subsequent to restart. However, saving state to (and recovering state from) many types of storage devices may often involve substantial performance overhead. In some cases recovery mechanisms may take so long that the probability of multiple cascading failures (additional failures before recovery from a first failure has been completed) may rise to unacceptable levels.
Recovery from failures that affect distributed applications is notoriously complex, especially when timing-related defects are involved. Such defects are hard to reproduce and debug, especially when the communications between multiple participating processes (e.g., participants in a distributed messaging protocol) are asynchronous in nature, and at least some of the inter-process communication messages may be lost due to the failures. The instrumentation mechanisms typically included in many applications and operating systems, such as logging messages at various levels of detail, may sometimes be insufficient for effective debugging, especially as some of the most relevant logging data may be lost during failure events.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.