As businesses increasingly rely on computers for their daily operations, managing the vast amount of business information generated and processed has become a significant challenge. Most large businesses have a wide variety of software application programs managing large volumes of data. Such data may be stored on many different types of storage devices across various types of networks and operating system platforms. These applications typically have different hardware resource requirements and business priorities, and one application may depend upon other applications.
Often coordination is needed between several different “layers” of software to provide all of the communication needed to support an application. An application therefore can be considered to operate in a “stack” of software layers that underlies communication with hardware devices. The bottom layer of the stack, typically an operating system operating in conjunction with device driver software, can communicate directly with hardware devices. Higher-level layers of the stack operate on a logical set of data, shielded from the details of communication with the actual hardware devices. These higher-level layers, such as file systems or database server software, use services of the underlying layers to perform functions on their behalf. Each software layer can be considered to operate within its own “context” based upon the logical representation of data available at that layer.
For example, an application program managing an employee database may involve various computer systems and software components, including an application component and a database component. Each of the components may run on a different type of computer system having one of several operating systems and/or file systems. As a result, a failure of the employee database application may be caused by a failure at one or more software layers and/or the underlying hardware running a process for that software layer. For example, a fault on one of the computer systems may be detected only by the operating system of that computer system, whereas corruption of a storage device for the database may be detectable by the database software. Insufficient processing resources of one of the computer systems may be detectable by the application software itself.
Adding to the complexity of application programs is a strategy known as clustering. In a clustered environment, computer systems and storage devices are interconnected, typically using a high-speed dedicated connection, to provide better performance, reliability, availability, and serviceability of applications. Redundant interconnections between the computer systems are typically included as well, and the collection of computer systems (also referred to as nodes), storage devices, and redundant interconnections is referred to herein as a cluster. The cluster may appear to perform as a single highly available system even though different software layers may be running on different nodes within the cluster. Furthermore, different types of clusters may be established to perform independent tasks, to manage diverse hardware architectures performing similar tasks, or when local and backup computer systems are far apart physically.
In such a multi-layered environment, determining the cause of failure of an application is very difficult. Each layer of software typically produces diagnostic information (often referred to as a log) when something goes wrong; however, each layer typically records this diagnostic information in its own layer-specific location and layer-specific format. Furthermore, diagnostic information is often recorded on different machines by different layers, a problem that is exacerbated when processes running a given layer of software can move from one node in a cluster to another. The logistics of combining all of this diagnostic information into a sequence of events and filtering out irrelevant data has heretofore been too difficult to perform in an automated fashion. Typically, whenever a problem occurs, someone who is familiar with all the layers of software related to the application manually analyzes the various log files and figures out what went wrong. Even if the expert can determine the source of the problem, no easy way exists for the expert to express the problem in a way that can be compared against events to detect and forestall similar failures.
Some software vendors have provided tools to assist with problem analysis, such as OpenView® provided by Hewlett-Packard® and Netcool® from Micromuse®. However, most existing tools focus on root cause analysis for networking components in a given networking environment and not root cause analysis for applications involving several layers of components in a heterogeneous environment. Furthermore, most existing tools focus on “online” root cause analysis; for example, the tools try to identify the root cause of a problem as events happen, but do not analyze historical log files.
What is needed is a tool that can be used to analyze diagnostic information produced by several software layers supporting an application in a clustering environment. Preferably, the tool should enable the user to obtain information from various log files in different formats and on different machines in a cluster. In addition, the tool should help to identify patterns of events that lead to failure for use in further problem analysis.