Distributed computer systems can be composed of multiple products and technologies, and as the name indicates, are distributed in nature. Analyzing failures in distributed and composed applications is a challenging task since the failures may come from a single component or multiple components, and/or may be due to a coordination mismatch among the components. Each component or product in a distributed computer system may have its own mechanism for troubleshooting problems and failures of that particular component or product. For example, many systems have a built-in mechanism to write errors or failures to log files, and applications exist that gather log files from multiple components into a single place for easier review and troubleshooting of failures.
Identifying a root cause of a non-trivial failure in a distributed computer system may take many exchanges of information and analysis between customer and system support teams, since in such scenarios log files may not be sufficient to diagnose the failures. In order to understand the failures, support people may start gathering more information about application configurations, runtime configurations, and sometimes infrastructure configurations. Finally, as noted above, the reasons for failures could vary from missing or inappropriate configurations, conflicts in the environment, or the unavailability of dependencies. While the root cause of a failure may eventually turn out to be trivial and relatively easy to remedy, the time and effort that was needed to identify the root cause may have been substantial, and the reputation of a software provider may be damaged by the failures and the time and resources that it took to remedy such failures.