This invention addresses the problem of matching of general ordered sequences of symbols. Addressing this problem is important in many areas. For example, future enterprise search will accept large document queries instead of a few keywords, and would need to exploit the order in query keywords to find more meaningful matches. Both the documents and the queries can then be modeled as ordered sequences. Similarly, ordered sequence matching is a common operation in biological sequence alignment in bioinformatics, particularly for multiple gene sequence alignment.
Diagnosing software problems that are reported to customer support is costly in terms of skilled labor costs. However, typically half, and sometimes as much as 90 percent, of all problems reported by users are actually re-occurrences or rediscoveries of known problems. While such statistics may seem encouraging, customer support staffs spend a significant amount of time to manually determine whether or not a reported problem is already known. Automated techniques for discovering similarities in reported problem descriptions can therefore significantly reduce support costs and are thus needed. Furthermore, this approach does not demand that customer support staff have substantial software engineering skills, thereby decreasing the required resource costs.
Inferring relationships from the natural language descriptions of problem reports, however, is a challenging problem that is beyond the scope of current natural language understanding technology. Fortunately, the problem reports often contain structured diagnostic information that is automatically generated by the troubled software components. Such information is more amenable to automated matching than semi-structured or unstructured symptom descriptions provided by humans.
Automated problem determination can be made possible by matching program failure signatures such as call stacks. Call stack traces are the most prevalent type of information collected by software system in case of a system hang or a crash. Call stacks reconstruct the sequence of function calls leading up to the failure via the operating system's stack of addresses that is pushed each time a function is called and popped when it returns. These stack traces provide the function name and line number or offset where the crash occurred (in case of crash) along with the function path through which that call was made. The information about the line number or offset is not reliable, however, as these tend to differ from platform to platform (for offset) or from one version to another (for line numbers). Also, the parameter values are not available on all platforms for all types of parameters.
Call stacks are good indicators of problem descriptions because a single problem is characterized by a small, if not unique, set of call stack appearances. Typically, if two execution problems have the same origin, they should have the same function name on the top of the stack. If two call stacks have the same function name at the top of the stack, they are likely to be due to the same problem, although there is a possibility that there may be more than one error in the function which is at the top of the stack. While it is possible that the same problem can be manifested by very different call stacks, it is unlikely that two very similar call stacks correspond to widely different problem descriptions. Thus, finding matching call stacks can provide useful structured information for automatic problem determination.
The problem of call stack matching for identifying known problems was first addressed in [02]. This reference established the efficacy of call stack matching for known problem determination. It also introduced the need for recursion removal, identifying uninformative function names, and proposed the LCS algorithm which is evaluated herein. However, most of the discussion in this paper was qualitative with little experimental results presented.
Call stack matching for known problem determination was also reported in [4]. However, the authors do not provide means to identify uninformative functions or to account for the recursive function calls, though a matching algorithm that finds the single longest common subsequence of function names is provided. The authors also proposed a learning scheme for the matching purpose, which tries to determine a ‘signature’ of the problem by looking at all the stacks corresponding to one problem, and identify the common part of such stacks. Their performance results are unfortunately presented for only a very small data set.
Call stacks have been analyzed for other purposes. In [5], the call stack information is used to develop an anomaly detection algorithm for intrusion detection. The collection of Java stack traces in a distributed environment is described in [6] and various techniques for comparing the call stacks are discussed.
The general idea of solving problems by matching symptoms against a historical database is also a well-known technique, known as case-based reasoning (CBR). It has been applied to customer support and help desk situations [7, 8], but these approaches try to find similarities in the problem report information supplied by users, not at the program execution level. They typically do not consider call stacks or the approach of looking for sequences in the data. An automated method of matching ordered sequences for call stack analysis applications is therefore needed.