Distributed systems may be difficult to develop, to test, and to debug. There are conditions for bugs to manifest, such as user requests, service loads, hardware resources, and system scale, that are typically hard to replicate in a test environment. As a result, testing and debugging in a test-lab leave many undetected bugs that only surface when a system is brought online.
Traditional bug-finding approaches mainly focus on pre-shipped systems. For example, model-checkers control the input and virtualizes environments in which a system is running to systematically explore the system space and check a predicate violation to spot a bug site. A problem of state explosion often limits the testing scale to be depressingly small compared to the deployed system. Similarly, the environment is much more simplified. The testing could not possibly identify performance bugs as this requires a real environment and load requests.
Another problem is after the system is deployed, undetected bugs usually occur, either violating correctness properties or degrading performance. Catching these bugs and finding out root causes are challenging for these particular conditions, because there is no bug checking facility as in controlled test-labs. Thus, there are deficiencies with existing debugging tools.