Coping with software defects that occur in the post-deployment stage is a challenging problem: bugs may occur only when the system uses a specific configuration and only under certain usage scenarios. Nevertheless, halting production systems until the bug is tracked and fixed is often not feasible. Thus, developers have to try to reproduce the bug in laboratory conditions. Often the reproduction of the bug consists of the lion share of the debugging effort.
Despite increasing efforts and success in identifying and fixing software defects early in the development life cycle, some defects inevitably make their way into production. The wide variety of deployment configurations and the diversity of usage scenarios is almost a certain guarantee that any large system will exhibit defects after it has been deployed. Detecting and diagnosing defects in a production environment remains a significant challenge. Failures in such environments might occur with low frequency and be virtually impossible to reproduce. For example, a defect might occur due to a specific concurrent interleaving, a specific lengthy user interaction, or a slow resource leak that gradually degrades system performance leading to an eventual crash.
Existing tools for diagnosing defects “in the wild” usually incur a large overhead that may significantly disrupts the operation of the deployed system. On the other hand, reproducing the failure in a test environment (if possible) may require considerable time and effort. One way to detect rarely occurring defects is to continuously monitor a system for violations of specified correctness properties. For example, this can be achieved by using global property monitors and local assertions. However, the typical cost of these techniques prevents programmers from widely using them in production environments.