Traditionally, domain experts manually analyze event traces to diagnose performance issues when a computer system becomes slow or non-responsive. Such human interaction limits the effectiveness of trace analysis because manual trace-by-trace analysis is expensive and time consuming. In addition, manual trace-by-trace analysis does not scale-up to the millions of traces available, such as from software vendors.
Typically an analyst must be a domain expert, and even such experts cannot efficiently analyze and pass change requests to developers. For example, upon receiving an event trace, the analyst must identify a problem in the trace, infer a cause of the problem, scan a database of known issues and root causes, and when a match is found, forward a change request to a developer. However, when no match is found, the analyst will undertake even more expensive interaction by looking deep into the trace and corresponding source code to identify a root cause of the problem. The analyst will then submit a fix request to a developer and append the new issue and root cause to the database of known issues and root causes. While the analyst may be very good, the analyst still must look at each event trace received in order to request a fix. In addition, because the traces causing the most problems do not rise to the surface, the analyst, and hence the developer, may be working on a problem that causes a minor annoyance while a seriously disruptive problem waits for attention.