The invention relates generally to computer operating systems, and deals more particularly with a distributed computer operating system for a distributed application which operating system can automatically and efficiently resynchronize a two-phase commit procedure after a sync point failure.
This patent application is related to U.S. patent applications:
U.S. patent application Ser. No. 07/525,430, entitled "LOG NAME EXCHANGE FOR RECOVERY OF PROTECTED RESOURCES" filed May 16, 1990 by M. K. Ainsworth et al.;
U.S. patent application Ser. No. 07/526,471, entitled "OPTIMIZATION OF COMMIT PROCEDURES" filed May 16, 1990 by A. Coleman et al.;
U.S. patent application Ser. No. 07/525,938, entitled "RECOVERY FACILITY FOR INCOMPLETE SNYC POINTS FOR DISTRIBUTED APPLICATION" filed May 16, 1990 by M. K. Ainsworth et al.;
U.S. patent application Ser. No. 07/525,427, entitled "COORDINATED SYNC POINT MANAGEMENT OF PROTECTED RESOURCES" filed May 16, 1990 by A. Coleman; and
U.S. patent application Ser. No. 07/525,939, entitled "REGISTRATION OF RESOURCES FOR COMMIT PROCEDURES" filed May 16, 1990 by A. Coleman; and
U.S. patent application Ser. No. 07/526,472, entitled "COORDINATED HANDLING OF ERROR CODES AND INFORMATION DESCRIBING ERRORS, IN A COMMIT PROCEDURE" filed May 16, 1990 by E. A. Pruul et al.
U.S. patent application Ser. No. 07/525,426 entitled "LOCAL AND GLOBAL COMMIT SCOPES TAILORED TO WORK UNITS" filed May 16, 1990 by B. A. Maslak et al.
The operating system of the present invention can be used in a network of computer systems. Each such computer system can comprise a central, host computer and a multiplicity of virtual machines or other types of execution environments. The host computer for the virtual machines includes a system control program to schedule access by each virtual machine to a data processor of the host, and help to manage the resources of the host, including a large memory, such that each virtual machine appears to be a separate computer. Each virtual machine can also converse with the other virtual machines to send messages or files via the host. Each virtual machine has its own CMS portion of the system control program to interact with (i.e., receive instructions from and provide prompts for) the user of the virtual machine. There may be resources such as shared file system (SFS) and shared SQL relational databases which are accessible by any user virtual machine and the host.
Each such system is considered to be one real machine. It is common to interconnect two or more such real machines in a network, and transfer data via conversations between virtual machines of different real machines. Such a transfer is made via communication facilities such as AVS Gateway and VTAM facilities ("AVS Gateway" and "VTAM" are trademarks of IBM Corp. of Armonk, N.Y.).
An application can change a database or file resource by first making a work request defining the changes. In response, provisional changes according to the work request are made in shadow files while the original database or file is unchanged. At this time, the shadow files are not valid. Then, the application can request that the changes be committed to validate the shadow file changes, and thereby, substitute the shadow file changes for the original file. A one-phase commit procedure can be utilized. The one-phase commit procedure consists of a command to commit the change of the resource as contained in the shadow file. When resources such as SFS or SQL resources are changed, the commits to the resources can be completed in separate one-phase commit procedures. In the vast majority of cases, all resources will be committed in the separate procedures without error or interruption. However, if a problem arises during any one-phase commit procedure some of the separate commits may have completed while others have not, causing inconsistencies. The cost of rebuilding non-critical resources after the problem may be tolerable in view of the efficiency of the one-phase commit procedure.
However, a two-phase commit procedure is required to protect critical resources and critical conversations. For example, assume a first person's checking account is represented in a first database and a second person's savings account is represented in a second database. If the first person writes a check to the second person and the second person deposits the check in his/her savings account, the two-phase commit procedure ensures that if the first person's checking account is debited then the second person's savings account is credited or else neither account is changed. The checking and savings accounts are considered protected, critical resources because it is very important that data transfers involving the checking and savings accounts be handled reliably. An application program can initiate the two-phase commit procedure with a single command, which procedure consists of the following steps, or phases:
(1) During a prepare phase, each participant (debit and credit) resource is polled by the sync point manager to determine if the resource is ready to commit all changes. Each resource promises to complete the resource update if all resources successfully complete the prepare phase i.e. are ready to be updated.
(2) During a commit phase, the sync point manager directs all resources to finalize the updates or back them out if any resource could not complete the prepare phase successfully.
An IBM System Network Architecture SNA LU6.2 architecture (reference SC31-6808, Chapter 5.3 "Presentation Services-Sync Point Verbs", published by IBM Corp.) was previously known to coordinate commits between two or more protected resources. This architecture previously addressed sync point facilities consisting of a sync point manager which performed both sync point and associated recovery processing running in a single application environment. Several adapters could run simultaneously in this environment. The LU6.2 architecture supports a sync point manager (SPM) which is responsible for resource coordination, sync point logging and recovery. The prior art CICS/VS (trademark of IBM Corp. of Armonk, N.Y.) environment supports such an architecture.
According to the IBM SNA LU6.2 architecture prior art, in phase one and in phase two, commit procedures are executed and the sync point manager logs the phase in the sync point log. Also, the sync point manager logs an identification number of a logical unit of work which is currently being processed. Such logging assists the sync point manager in resource recovery or resynchronization in the event that a problem arises during the two-phase commit procedure. If such a problem arises after the two-phase commit procedure has begun, the log is read and resource recovery processing is implemented to bring associated resources to a consistent state. The problems include failure of a communication path or failure in a resource manager.
The aforesaid SNA LU6.2 sync point architecture manages a commit failure in the following manner. The sync point manager that knows its second phase decision based on the state in the log entry invokes a complete resynchronization operation with any failed resources to which it was coordinating before returning control to the application program that requested the commit. One of the failed resources can be a protected conversation. In the aforesaid SNA LU6.2 sync point architecture, the initiators sync point manager must reestablish a session with the partner sync point manager or recovery facility in the system where the failure occurred. If such a session is not immediately available, the sync point manager continues to seek a session until one is available. For other protected resources which also need to be resynchronized, a session may also be needed with the resource manager that encountered the failure. The sync point manager cannot complete its processing until recovery takes place. The delay can be protracted and the initiating application and possibly other participation applications is prevented from doing other useful work during the delay. The SNA LU6.2 sync point architecture permits a heuristic decision (manual or system default intervention) to force resynchronization. The intervention could be programmed or directly controlled by an operator to prevent indefinite interruption to the application program. However, the intervention may cause heuristic damage whereby some resources involved in the sync point are committed and some are backed out.
It was also known from an article entitled "A Commit Protocol for Resilient Transactions" by Pui Ng from the University of Illinois at Urbana-Champaign, to provide an application program which is checkpointed at certain intervals in its processing. During each checkpoint, information about the state of a process is written onto a back-up node. If a failure occurs after a completed checkpoint and before the next checkpoint, all processing and updates occurring after the completed checkpoint must be backed out. This backout occurs asynchronously relative to the application program, and the application program can restart at the checkpoint without waiting for the backout to occur. When restarted, the application program can attempt a new instance of the same routine to process the same data under a new name. This new instance becomes the valid one, and the prior one under its original name becomes invalid. The article also describes a method for naming the instances to differentiate the valid one from the invalid one. However, this article is not concerned with asynchronous recovery of a failed commit procedure.
Accordingly, a general object of the present invention is to provide a process for resynchronizing a commit procedure for protected resources and conversations while avoiding extensive delays in the operation of an application program that initiated the commit procedure.
Another object of the present invention is to allow an application to make a local decision whether or not the sync point manager should wait for resynchronization to occur before returning to the application.