Software failures that occur in production are often difficult to reproduce due to differences between a production system and a development system. Reproducing a software failure is one of the most time consuming and difficult steps in resolving a software problem. A variety of operating systems, corresponding libraries and their versions, application tiers supplied by different vendors, and network infrastructure with different configuration settings make application environments complex and software failures hard to reproduce.
The source of the problem might be an incorrect assumption implicitly made by the application about the availability or configuration of local services such as domain names, deployed software components or library versions. Furthermore, non-deterministic factors such as timing and user inputs contribute to the difficulty in reproducing software failures. The common approach of conveying a failure report is often inadequate and time-consuming.
Some application vendors provide built-in support for collecting information when a failure occurs. Other sophisticated facilities may provide more comprehensive data including traces and internal application state. These facilities include program execution record and replay tools. Record and replay tools, however, are often limited in their ability to provide insight into the root cause of a problem because they represent the aftermath of the failure and not the steps that precede it. Furthermore, indiscriminate recording and transfer of data present additional data storage requirements.
Conventional record and replay techniques isolate the system calls made by an application and replay the results back to the application during replay. However this simplistic model is often inadequate. When an application is being replayed, it relies on a variety of third party libraries and install base. If the libraries needed by the application do not exist, or if the required libraries are installed but their versions are incompatible with the application, the replaying application might fail or diverge from its initial execution. Discrepancy in binaries is not limited to the auxiliary libraries used by the application. The versions of the installed application binaries themselves may be different and hence the application would exhibit an inconsistent behavior during replay.
In order to avoid binary incompatibilities, some record and replay systems require that the record environment and the replay environment are identical. However, this requirement often cannot be met. For instance, when the recorded log is replayed in a programmer's environment, the execution of an application might diverge because the programmer's environment might be configured differently. A discrepancy in the installed base, such as support libraries and DLL files, would impact replay and make it diverge from the originally recorded execution.
Data storage overhead is another consideration when recording memory pages. Conventional checkpointing techniques generally capture the complete state of an application for replay, including the state of file descriptors and various operating system resources. As a result, the amount of recorded data is relatively large which makes it necessary to impose dependencies on the replay environment, such as requiring the files in a persistent storage be available during a replay.
From the foregoing, it is appreciated that there still exists a need for efficiently recording the execution of a program and replaying the recording in a different operating system environment without the aforementioned drawbacks.