1. Field
The present disclosure relates generally to distributed processing systems, and more particularly, to systems and techniques for recovering from system failures.
2. Background
Computers and other modern processing systems have revolutionized the electronics industry by enabling complex tasks to be performed with just a few strokes of a keypad. These processing systems have evolved from simple self-contained computing devices, such as the calculator, to highly sophisticated distributed processing systems. Today, almost every aspect of our daily lives involves, in some way, distributed processing systems. In its simplest form, a distributed processing system may be thought of an individual desktop computer capable of supporting two or more simultaneous processes, or a single process with multiple threads. On a larger scale, a distributed processing system may comprise a network with a mainframe that allows hundreds, or even thousands, of individual desktop computers to share software applications. Distributed processing systems are also being used today to replace traditional supercomputers, with any number of computers, servers, processors or other components being connected together to perform specialized applications that require immense amounts of computations. The Internet is another example of a distributed processing system with a host of Internet servers providing the World Wide Web.
As we become more dependent upon distributed processing systems in our daily lives, it becomes increasingly important to guard against system failures. A system failure can be at the very least annoying, but in other circumstances could lead to catastrophic results. For the individual desktop computer, a system failure can result in the loss of work product and the inconvenience of having to reboot the computer. In larger systems, system failures can be devastating to the business operations of a company or the personal affairs of a consumer.
A number of system recovery techniques are employed today to minimize the impact of system failures. One such technique involves “checkpointing” and “rollback recovery.” During normal operation, each of a computer's processes saves a snapshot of its states, called a “checkpoint,” to stable storage. When a failure occurs, a rollback recovery program may retrieve a set of saved checkpoints. The failed process can then roll back to the corresponding retrieved checkpoint and resume execution from there. A checkpoint library comprising a collection of precompiled routines may be implemented in a distributed processing system to support checkpoint and rollback recovery programs. Checkpoint libraries may be particularly useful for storing frequently used routines because they do not need to be explicitly linked to every program that uses them. Instead, a linker automatically looks in libraries for routines that it does not find elsewhere.
Checkpoint libraries are typically linked at runtime with the applications they monitor. During the course of monitoring an application, and compiling meta-data necessary for taking the next checkpoint, a checkpoint library will need to dynamically allocate memory to store the meta-data. In a traditional system, the process address space shared by the application, the checkpoint library and all other libraries loaded at runtime will share a single heap, which is the source of dynamically allocated memory. Unfortunately, applications and libraries sometimes contain software errors that affect the handling of dynamically allocated memory. For example, in a common error known as an “overflow,” an application allocates a section of memory space and attempts to modify it but, due to a math error in the application's code, modifies an address outside of the allocated memory space. Since all dynamic memory is allocated from the same heap, the inadvertently modified memory address may already be in use by the application or another library. If so, this unexpected modification can alter execution of the application or other library, causing serious problems to system operation.
Typical debugging processes may not flush out such programming errors. For example, if an overflow error within a system library causes modification of unused memory blocks or blocks that an application is no longer using, the system library and application may operate without being affected by the error. However, when a checkpoint library is added to the system, the error may cause trouble if routines within the checkpoint library happen to use the memory blocks that are modified by the overflow error in the system library. The checkpoint library memory allocations, when interleaved with system library memory allocations, create a new allocation pattern that may result in the previously harmless error becoming memory corruptions that render the checkpoint library unusable. A user of the system may attribute the trouble to the checkpoint library, even though the error is in the system library, because the error did not surface until the checkpoint library was added to the system. People may stop using the checkpoint library because they mistakenly perceive it to be the root of the new troubles. Checkpoint library providers may be looked poorly upon, even though their libraries contain no errors and would work perfectly if the system library did not have errors. Ultimately, checkpoint library developers may gain a poor reputation and be unable to effectively promote their products.