Computers have proliferated into all aspects of society and increasingly form the backbone of large organisations. When a computer system fails to respond as intended, businesses are invariably unable to provide even the most basic of services. Money, reputation or even lives may be lost, dependant upon the criticality of the service, the outage time etc.
In today's increasingly competitive marketplace, quality is of the utmost concern. Users' expectations are high and thus the reliability of hardware/software is especially important. Customers do not tolerate mistakes and the later a defect is discovered, the more costly it can prove to the manufacturer.
Typically therefore, software developers implement a rigorous testing regime in order to ensure that a product or system is fully operational and consistently performs according to its functional specification, before it is made available to the public.
Software development is however an increasingly complex process. Even using the most rigorous techniques, software developed by people will suffer failures. Most software developers understand this and therefore accept the need to be able to effectively diagnose the cause of such a software failure.
A most effective technique is known as “tracing”. In this technique a program records (“traces”) important information that can be used for further analysis after the occurrence of a software failure. FIG. 1 is a component diagram of a computer system performing tracing according to the prior art. A computer system 10, runs an operating system (o/s) 20, which could for example be Windows NT available from Microsoft Corporation. A program 30 executes on top of operating system 20 and may consist of a number of separate modules (not shown), each defining separate functionality. As program 30 executes, it calls various trace functions 40 which are used to record certain information in a trace log 50. Note, although the log is shown as present on the same computer system, computer system 10 may in fact be connected to another machine across a network (neither shown) and the trace log may instead reside on that machine.
Tracing typically occurs at three stages in the processing or executing of a program. These stages are known as the “entry”, “exit” and “exception” points of tracing.
The “entry” trace information is gathered when a module obtains control from another module. The information recorded is normally the input values provided to the module.
The “exit” trace information is gathered when a module returns control to a module from which it has previously received control. The information recorded is normally the output values returned to the module from which it obtained control in the first place.
The “exception” trace information is gathered whenever an extraordinary or interesting event occurs within a module after it has been entered but before it has returned control to the module from whom it originally received control. The information gathered is normally information that relates to the exceptional condition that has occurred.
As previously mentioned the entry, exit and exception information is outputted to a trace log. Upon the occurrence of a software failure, a software developer can analyse the trace information in this log. This information can then be used to determine which modules were called and what information was passed between them, including any interesting events that have occurred. This hopefully allows the developer to understand and locate the source of the error and fix it.
FIG. 2 shows entry, exit and exception tracing according to the prior art. A program 100 consists of two modules, “Main” 105 and “Function” 110. The program runs on an operating system 120 which initially passes control to Main. Main receives two parameters, p1 and p2 as input. These may be specified by the user from the command-line or be the result of output from another program/module (not shown). Main calls a trace_in function to record entry trace to trace log 130, and a trace_out function to record exit trace to trace log 130.
Trace_in is passed three parameters, “Main”, p1, and p2. “Main” is used as an identifier in the trace log such that the values of p1 and p2 can be connected with module Main. It can be seen from the log that parameters p1 and p2 have values of 1 and 3 respectively upon entry to Main. Control is then passed to module FUNCTION 110, which passes parameters p3 and p4 as parameters. FUNCTION performs a trace_in on both p3 and p4, which are identified by the heading “FUNCTION” in trace log 130. It can be seen that p3 has a value of 2 and p4 has a value of 4 upon entry to the module FUNCTION.
A trace_data function writes extraordinary events/events of interest to the trace log. It passes an identifier “FUNCTION” and the data to be read out DATA1, or a number of parameters. This is exception trace.
Module FUNCTION ends and a trace_out function is called passing it the “FUNCTION” identifier, along with parameters p3 and p4. Note, the dotted lines connecting the various trace functions represent calculations and processing performed by the program but not shown (i.e. additional program instructions). It can be seen from trace log 130 that such program instructions have manipulated p3 and p4 such that their values have changed from those initially recorded at trace_in. P3 now equals 5 and p4 equals 2.
As previously mentioned, exit information (via the trace_out function) is recorded prior to a module relinquishing control to the module which originally called it. In this example control is returned to Main 105, additional processing performed (dotted lines) and the trace_out function called. It is passed the “Main” identifier, along with p1 and p2. The trace log once again shows that additional program instructions have manipulated p1 and p2 such that their values are now 4 and 1 respectively. Control is then returned to operating system 120.
Tracing, typically occurring at entry, exit and exception, has a very coarse granularity. It provides very little information as to the path followed during program execution. It is possible to obtain a reasonable level of detail by calling the trace_data function on a regular basis (or whenever anything unusual happens (e.g. the value of a particular parameter is not as expected)), but this is expensive in terms of CPU time. Even on the OS/390 systems (available from IBM Corporation), in which trace information is dumped to fast volatile memory, performance is degraded by approximately 50%. For this reason customers often cannot afford to leave trace running continuously. Thus when a software failure occurs, the operating system often provides call hierarchy (i.e. which modules were called during program execution) but does not provide any more information than that. Time is lost because, trace then has to be turned on and the failure simulated before diagnostics can begin. Customers do not expect these sort of delays. In their eyes, software that they've paid good money for is not supposed to fail.
Not only is tracing expensive in terms of CPU performance, but hand-instrumenting a program with trace instructions is a time consuming job. Knowing the most appropriate points at which to place trace_data statements also requires experience. Trace_in and trace_out statements, on the other hand, always go at the beginning and end of a module respectively.
Even when all the appropriate trace information is captured in a trace log, this information can be difficult to extract. When a module completes, it is only possible to determine which processing occurred within the module by searching for the previous entry trace for that module and examining all the trace entries that this module may have called. This is exceptionally tedious, especially if the flow of control through modules is complex. The situation may be further complicated in a multi-user system where trace records for multiple activities are traced at the same time.
FIG. 3 shows tracing in a multi-user system according to the prior art. A computer system (not shown) runs a Process 1. This includes three program modules A, B and C, each with trace_in and trace_out function calls. Trace Log 150 shows the trace information outputted by process 1. It can be seen that trace_in (ti) is called for all modules. Trace_out (to) is then called for module C. Note, although not shown, the log will in reality contain parameter values for each trace function call (see FIG. 2).
Before trace_out can be called for modules B and A (in that order), process 2 is started. Process 2 also includes three modules, A, B and C, each calling the trace_in and trace_out functions. Thus process 2 dumps the trace_in parameter values for module A out to log 150. This entry fills up the log. New information is thus written to the beginning of the log (shown by the circular arrows). Trace Log 150′ shows what happens next. Process 2 runs faster than process 1 and thus it continues to write to the trace log 150′ from the top. Process 2 thus outputs trace_in information for its module B, and C. It also dumps trace_out information for its modules C and B. Obviously both processes have not finished outputting information to the trace log 150′, but it can be seen that at this point in time all the trace information above the dotted line in trace log 150 for process 1, has been overwritten by process 2. Thus in the event of a software failure, log 150′ provides very little help in diagnosing the execution flow taken by process 1. Note, Processes 1 and 2 may be on the same machine or on separate machines.