The invention relates generally to computer operating systems, and deals more particularly with a computer operating system which coordinates the handling of error flags and information describing errors in a commit procedure for heterogeneous resources.
This patent application is related to U.S. patent applications:
U.S. patent application Ser. No. 07/525,430, entitled "LOG NAME EXCHANGE FOR RECOVERY OF PROTECTED RESOURCES" filed May 16, 1990 by M. K. Ainsworth et al.;
U.S. patent application Ser. No. 07/526,471, entitled "OPTIMIZATION OF COMMIT PROCEDURES" filed May 16, 1990 by A. Coleman et al.;
U.S. patent application Ser. No. 07/525,938, entitled "RECOVERY FACILITY FOR INCOMPLETE SYNC POINTS FOR DISTRIBUTED APPLICATION" filed May 16, 1990 by M. K. Ainsworth et al.;
U.S. patent application Ser. No. 07/525,427, entitled "COORDINATED SYNC POINT MANAGEMENT OF PROTECTED RESOURCES" filed May 16, 1990 by M. K. Ainsworth;
U.S. patent application Ser. No. 07/525,939, entitled "REGISTRATION OF RESOURCES FOR COMMIT PROCEDURES" filed May 16, 1990 by A. Coleman; and
U.S. patent application Ser. No. 07/525,429, entitled "ASYNCHRONOUS RESYNCHRONIZATION OF A COMMIT PROCEDURE" filed May 16, 1990.
The operating system of the present invention can be used in a network of computer systems. Each such computer system can comprise a central, host computer and a multiplicity of virtual machines or other types of execution environments. The host computer for the virtual machines includes a system control program to schedule access by each virtual machine to a data processor of the host, and help to manage the resources of the host, including a large memory, such that each virtual machine appears to be a separate computer. Each virtual machine can also converse with the other virtual machines to send messages or files via the host. Each virtual machine has its own CMS portion of the system control program to interact with (i.e., receive instructions from and provide prompts for) the user of the virtual machine. There may be resources such as shared file system (SFS) and shared SQL relational databases which are accessible by any user virtual machine and the host.
Each such system is considered to be one real machine. It is common to interconnect two or more such real machines in a network, and transfer data via conversations between virtual machines of different real machines. Such a transfer is made via communication facilities such as AVS Gateway and VTAM facilities ("AVS Gateway" and "VTAM" are trademarks of IBM Corp. of Armonk, NY).
An application can change a database or file resource by first making a work request defining the changes. In response, provisional changes according to the work request are made in shadow files while the original database or file is unchanged. At this time, the shadow files are not valid. Then, the application can request that the changes be committed to validate the shadow file changes, and thereby, substitute the shadow file changes for the original file. A one-phase commit procedure can be utilized. The one-phase commit procedure consists of a command to commit the change of the resource as contained in the shadow file. When resources such as SFS or SQL resources are changed, the commits to the resources can be completed in separate one-phase commit procedures. In the vast majority of cases, all resources will be committed in the separate procedures without error or interruption. However, if a problem arises during any one-phase commit procedure some of the separate commits may have completed while others have not, causing inconsistencies. The cost of rebuilding non-critical resources after the problem may be tolerable in view of the efficiency of the one-phase commit procedure.
However, a two-phase commit procedure is required to protect critical resources and critical conversations. For example, assume a first person's checking account is represented in a first database and a second person's savings account is represented in a second database. If the first person writes a check to the second person and the second person deposits the check in his/her savings account, the two-phase commit procedure ensures that if the first person's checking account is debited then the second person's savings account is credited or else neither account is changed. The checking and savings accounts are considered protected, critical resources because it is very important that data transfers involving the checking and savings accounts be handled reliably. An application program can initiate the two-phase commit procedure with a single command, which procedure consists of the following steps, or phases:
(1) During a prepare phase, each participant (debit and credit) resource is polled by the sync point manager to determine if the resource is ready to commit all changes. Each resource promises to complete the resource update if all resources successfully complete the prepare phase i.e. are ready to be updated. PA0 (2) During a commit phase, the sync point manager directs all resources to finalize the updates or back them out if any resource could not complete the prepare phase successfully.
If there is an error or failure during a two-phase commit procedure, it is important to advise the application of the nature of the problem so that it can assist in correcting the problem or taking other action. For example, if a synchronization point cannot be obtained because a participating file is open, then it is preferable to advise the application of the state of the file so the application can proceed with another operation and request a commit for this file later. Also, if a synchronization point is requested for a protected conversation, and the protected conversation is in an improper state to commit, then the application can endeavor to change the state of the protected conversation and subsequently request a synchronization point. Thus, it is important that the application know which of the participating resources failed and have detailed information describing the nature of the error.
As noted above, different types of resources can be accessed by an application. Different types of managers of the different resources can have different protocols for responding to failures occurring during a synchronization point. In the case of the prior art VM Shared File System, the application can provide the address of a location in the application execution environment to store a copy of the error information. If an error arises, then the error information is automatically transmitted from the resource manager to this location. The information includes one or more error descriptions and identifies the resource which failed. In this example, the application is familiar with the format of the information furnished by the resource.
In the prior art SQL/DS Relational Data Base System, when an error occurs during a work request involving the SQL/DS system, a manager within the SQL/DS system detects the error and transmits detailed error information to a memory space known to the distributed application's environment. Next, the application can read and analyze the error information from its memory references above. Other resources and resource managers exist with other, different protocols and new resource managers will have their own protocols optimized for their own purposes.
Also, in the prior art, if an application initiates a protected conversation to a communications partner, and the protected conversation subsequently fails due, for example, to a loss of communication, the VTAM communications facility detects the failure, and transmits an error return code to the application. The error return code indicates the existence of a failure and the cause of failure. The application program knows which partner failed because this prior art system supported commands to a single partner only.
According to the prior art also, the resources and protected conversations are treated independently in so far as error return codes and detailed error information.
Accordingly, a general object of the present invention is to provide an operating system which coordinates the collection of information from heterogeneous resources describing errors in a synchronization point.
Another object of the present invention is to provide an operating system of the foregoing type which coordinates the distribution of the detailed information, especially the resource type and the name, or identification, of any failing resource, to an initiating distributed application.
Another object of the present invention is to provide an operating system of the foregoing type which does not affect system performance if no errors occur.
Another object of the present invention is to provide operating systems of the foregoing types which are compatible with the architecture and design of existing resource managers.
Still another object of the present invention is to provide an operating system of the foregoing type which permits prior art applications that access the VM Shared File System and SQL/DS system described above, and other existing resource managers, to run unchanged on the operating system defined by the present invention.