When a new computer software product is conceived, there is a typical cycle or process that takes place in the course of bringing the product to the market to ensure all the reliability and serviceability required by the customer. The programming cycle typically includes:
the conception of the idea PA0 the design of the software to implement the idea PA0 the coding of the program based on the design PA0 the initial testing and debugging in the development environment PA0 the testing at the user site PA0 the final release of the software to the market PA0 the maintenance PA0 the update of the software with new releases PA0 all programs may fail or produce erroneous data PA0 a program detecting an error may be itself in error PA0 all detected errors or failures, permanent or transient, must be reported with all the data needed to understand what is going on. A transient error means a temporary failure which is recovered by the program itself but is however reported for information purpose. PA0 all errors fall in one of the following categories: PA0 The error occurs before the program can detect it and the key data required to determine the cause of the error or failure are often lost or overlaid PA0 the more complex the error is, the more data are generated PA0 the dispersion of the information in the system storage increase the difficulty to isolate complex errors PA0 the transfer of a large quantity of data is resource consuming in time and storage and can affect the customer performances. PA0 Traces require a branch to a trace routine every time a trace point is encountered, often resulting in a significant impact to not only the problem program's performance, but to other programs executing on the same system PA0 Traces requires large data sets to contain the volumes of data generated by Trace points PA0 for the programmer that uses Traces to capture diagnostic data, he invariably finds himself sifting through large amounts of data, the majority of which does not reflect on the cause of the error PA0 the problem must be reproduced. If the software problem was caused by a timing problem between two programs (e.g., two networking programs communicating with each other), the trace can slow the program execution down to the point where most timing problem cannot be recreated. PA0 The reported information can be processed, visualized and interpreted in real time mode. PA0 The data required to diagnose the error are captured the first time the error appears: the problem does not have to be recreated. PA0 Error can be isolated and its propagation stopped before permanent damage can occur. PA0 The data reported are limited to the error to be resolved which facilitates data report and the problem isolation and correction. PA0 This process is only called conditionally when the error is detected and remains completely idle until such condition occurs. The impact on the computer resources and the programs performances remains minimum. PA0 Selective Dumps, limited to the error context can be automatically triggered and retrieved on request of the program itself (Unsolicited Dump). PA0 Permanent Traces can be included in the captured and reported data. These Traces, also called internal Traces, are an integral part of the code. They are cyclically updated according to the program progress and thereby allow a dynamic view of the suspected code. PA0 The process can be extended to events to report data at some specific stages of the code progress or at particular occurrences. PA0 Minor Errors: the program detects itself the error or failure and the associated pertinent information are collected by means of a dedicated error code. PA0 Major Errors: the program loses control of the operations and is no longer able to detect itself the error or failure. The error is detected by an external system (the operating system, control program, data processor . . . ) and the pertinent information are collected and reported independently of the faulty program. The exceptions conditions such as divide error, invalid operation code, loop, floating point error . . . belong to this category of major error. PA0 the problem investigation in real time PA0 the automatic analysis of the data PA0 the triggering of specific recovery actions
Normally the release of a software product depends on meeting a development calendar. If defects or errors (known as bugs) appear in the code, the product deadlines will be missed. This is particularly likely if the bugs are complex, subtle or otherwise difficult to find. Such delays can cause a software product to fail in the marketplace. In the same way, the availability, the quality and the ease of maintenance of a software product are the key factors of a success in a competitive environment.
Historically, most software was designed under the assumption that it would never fail. Software had little or no error detection capability designed into it. When a software error or failure occurred, it was usually the computer operating system that detected the error and the computer operator cancelled the execution of the software program because the correct result was not achieved. To facilitate the development, test and maintenance of more and more important and complex programs, it has been necessary to define debugging methods and tools to detect, isolate, report and recover all software and hardware malfunctions.
Error handling and problem determination are based on the following principles:
Hardware error or failure PA1 Functional error PA1 Invalid input internal to the program PA1 Invalid input external to the program PA1 Time out PA1 Exception conditions, such as:
divide error PA2 invalid address PA2 loop PA2 invalid operation code PA2 floating point error PA2 . . . PA2 Exception conditions are errors detected by the processor itself in the course of executing instructions. They can be classified as Faults, Traps, or Aborts depending to the usage of the different suppliers of data processors.
Upon a software error, the most commonly used method is to capture the entire storage area allocated to the program: this process is called Global or Physical Dump. However,
As frequently happens, so much output is generated that any significant information is buried in a mass of unimportant details. Thus the programmer must always guess whether the benefits of preserving and retrieving all the data in the processor storage outweigh the disadvantages of an endless and laborious analysis. In another way, it is not always obvious to follow the path of execution to the point where the error finally appears and most program developers use a process called Trace to isolate a software error. According to this process, Trace points are placed within the failing program in order to sample data through the path of execution, the problem is recreated and data from the trace points are collected. Unfortunately, Traces have some bad side effects including the following:
Solicited Dumps and Traces, as described previously, are triggered on request of an external intervening party: console, host, operator . . . . They are based on a methodology which waits for the damage caused by a software error to surface. In both cases large amounts of data are collected, hopefully catching the data that will determine what was wrong.
Immediate error detection and automatic diagnostic data collection can be achieved by means of error code placed directly within the program during development. When an error or failure occurs, it is detected by the program itself which calls a process to capture and report only the data required to debug the error: this process is usually called Error Notification. The description of the data such as layout and format are generally stored in a predefined table whose entries are selected by the error detection code of the program. Typical of this state of the art is the U.S. Pat. No. 5,119,377 disclosing a method and system for detecting and diagnosing errors in a computer program. The major advantages of this process are the following:
The Error Notification process, previously described, implies that all pieces of code can detect and describe the errors in which they are involved with the actions to be done to recover the control or minimize the impact of the failing element. That means a systematic checking of all inputs (internal and external), the use of hardware checkers and the implementation of functional tests and timers in the key points of the code.
At this stage of analysis, it appears opportune to classify errors in two different types:
Major Errors pose the problem of the selective data capture by an independent system: the faulty program is no longer able to describe the useful data to report and the Error Notification method previously detailed becomes, in this case, inoperative. By lack of specific guidelines, the external code is constrained to collect without discernment all the available data by the means of a global Dump or external Traces. In addition of the disadvantages inherent to the Dump or Trace usage, this situation forbids: