The present invention relates generally to distributed applications and relates more specifically to error diagnosis in distributed applications.
The goal of distributed application diagnosis is to determine the root cause of a problem (i.e., the first event that led to an observed error). Although the problem can usually be solved easily once the root cause is known, determining the root cause is non-trivial, particularly in distributed environments.
One conventional method for diagnosing distributed applications involves manually interpreting application logs. However, application logs alone typically will not reveal the root cause, because even when they include log statements in their code to assist in diagnosis, these statements often contain incomprehensible, indirect, and/or misleading descriptions of the problem. In addition, temporal and spatial gaps often exist between when and where the problem occurs and when and where the log is recorded. Diagnosis becomes even more difficult when the application topology is more dynamic (e.g., due to virtualization, auto scaling, migration, and the like, as may be the case when deployed in the cloud).