1. Field of the Invention
This invention generally relates to the detection and identification of errors in programs which execute on multiple host processors and more particularly to the collection of the necessary data for determining the source of errors in a multi-host data base management system.
2. General Discussion
In today's computing environments, application programs are sometimes distributed over one or more host processors to enhance performance. To the extent that distributed application programs share resources and data, they need to coordinate activities to avoid deadlock situations and data corruption. The application programs accomplish this coordination by passing pertinent information amongst themselves. One area in which application programs are typically distributed is Data Base Management Systems (DBMS).
The XTC-UDS (EXtended Transaction Capacity-Universal Data System) which is commercially available from Unisys Corporation is an example of a DBMS in which a DBMS is operable on multiple host processors. The host processor to which the XTC-UDS DBMS is native is the 2200 Series data processing system which is also commercially available from Unisys Corporation. The XTC-UDS DBMS allows data base application programs executing on multiple host processors to share a common database and distribute their processing over several host processors. Applications which utilize a DBMS similar to XTC-UDS are typically transaction intensive and examples include airline reservation systems and bank transaction processing applications.
Along with the rise in processing power provided by deploying a DBMS on several host processors comes the added complexity of dealing with concurrent database applications seeking access to a shared data base. Issues of deadlock detection, cache coherency, exclusive update, etc. must all be addressed for the DBMS to operate properly. Most would agree that with the added complexity and the additional program code required to deal with the complexity, program bugs may go undetected in the course of normal product development, even with the best of software engineering practices.
Once a product goes to market, program errors become even more difficult to isolate due to the performance requirements placed on a commercial product. If a program spends too much time processing trace data to assist in locating logic errors, the program performance may become unacceptable. The problem becomes even more acute in multi-host environments where thousands of transactions are processed each second.
Historically, when a program such as a DBMS detected a problem on a first host processor, it saved the necessary data for later analysis. In addition, the error would be reported to an operator who could take whatever further steps were necessary. Meanwhile, the other host processors in the multihost environment will continue to process transactions. If the operator who became aware of the problem at the first host processor could not act quickly enough, the applications on the other host processors may destroy data which may be critical to discovery of the source of the problem. If the critical data is not available for analysis, the logic problem could go unsolved, only to resurface another day.
This invention provides a method for minimizing the risk of losing data which is critical to fault isolation and provides for collection of the necessary data at multiple host processors without operator intervention.