Many computer systems and programs automatically create a record of the sequence or flow of events that are carried out with respect to their operations, internal state, and external messaging. These events may be time-stamped. The recorded sequence of events is commonly referred to as a processing, activity, event, or error log. Typically, the log is used for subsequent analysis and trouble-shooting. The trouble-shooting could involve noting operation or communication errors and taking appropriate action.
For example, a file server system, which might service a plurality of users that request access to copies of data files, typically records event flows in a log. Responding to the requests for copies is seen by the system as a sequence of computer events that can be recorded in an activity log for later analysis. Another activity log example is that of a database manager which fields search query requests concerning a database from multiple users and returns data fitting the query requests. The requests and responses comprise a sequence of events. A further example is that of a commercial order-filling or banking system that creates an activity log as it receives orders or account changes and operates on them in a transactional fashion.
In each of these examples, a log is kept because it might be necessary to trace the sequence of events performed by the computer system. As mentioned, this is useful in recovering from a system malfunction or equipment failure. These logs, however, are not typically recorded in an easily readable manner. Much of the information is coded and relatively cryptic. That is, processing errors are not readily apparent from viewing the log and significant events can be easily missed. For example, in many computer systems, a system error log stores a record of all of the error reports that are received from system components. The error log is used to trace and understand faults that have occurred. The number of errors in the log can be very large, however, and with the exception of a few patterns that the analyst may recognize from experience, the error log generally provides no clue as to the source of the error or how to solve it. In complex systems, the analyst may not even be able to determine whether the entries in the error log are due to a hardware fault or to a software problem.
These problems are exacerbated in computer or software systems that include multiple processors or software subsystems, each producing its own event flow or log. For example, in networked and distributed computing systems, capabilities of the system may be distributed among a plurality of modules, and the control, supervision, and administration capabilities of the system may be distributed among a plurality of computing facilities operating in cooperation. As in stand-alone systems, rapid recognition of the sources of new problems are critical to understanding the current state of the networked or distributed system so that prompt action can be taken to resolve such problems.
When a problem occurs, many possible logs or combinations of logs may be produced by the networked or distributed system. Analyzing the root cause of the problem requires identifying the “back-trace” or thread of events that lead up to or result from the problem. However, the event flows and logs can be enormous. Output of this volume, when produced on human-readable media, is difficult to use for problem identification. Accordingly, the logs from a computer are typically recorded electronically (e.g., stored in computer memory), rather than printed, and the logs from several computers are sometimes collected and recorded at a central location. Then, analysts needing to review the electronically stored logs typically use a log analysis tool which allows the user to search for and display logs of interest.
Although logs may be stored, searched, and displayed electronically, it nonetheless remains difficult for persons examining the output to understand the significance of a particular log event, or to identify or select those events which may be important from among the large quantity of data collected. Again, this is especially so in complex, networked, or distributed computing environments. Thus, with current methods, analysts must have extensive subject-matter expertise often including the application of anecdotal knowledge regarding problems previously encountered. Moreover, existing analysis tools are particularly ineffective at establishing and displaying meaningful correlations or patterns among log events occurring in a single computer, events occurring in a group of related computers, events occurring in computers directed to the same application, and events occurring across the installed base of computers in a networked or distributed system.
A need therefore exists for an improved event flow or log analysis tool. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.