This invention relates to computer operating systems. More specifically, this invention relates to techniques for handling IMS messages in the same execution environment or different execution environments. The invention provides a method for retrying a CQS PUT command, which puts messages into the coupling facility, for messages for which the PUT command failed.
The invention is embodied in a method for retrying the PUT command for a failed message using retry logic that operates on the unit of work element containing information about the failed message.
This patent application is related to the following U.S. patent documents:
U.S. Pat. No. 5,410,684, entitled xe2x80x9cLOG NAME EXCHANGE FOR RECOVERY OF PROTECTED RESOURCESxe2x80x9d filed Sep. 20, 1993 by M. K. Ainsworth et al, continuation of Ser. No. 525,430 filed May 16, 1990;
U.S. Pat. No. 5,363,505, entitled xe2x80x9cLOCAL AND GLOBAL COMMIT SCOPES TAILORED TO WORK UNITSxe2x80x9d filed Jun. 9, 1993 by B. A. M. Maslak et al, continuation of Ser. No. 525,426 filed May 16, 1990;
U.S. Pat. No. 5,319,774, entitled xe2x80x9cRECOVERY FACILITY FOR INCOMPLETE SYNC POINTS FOR DISTRIBUTED APPLICATIONxe2x80x9d filed May 16, 1990 by M. K. Ainsworth et al;
U.S. Pat. No. 5,436,736, entitled xe2x80x9cCOUPLING FACILITY FOR RECEIVING, COMMANDS FROM PLURALITY OF HOSTS FOR ACTIVATING SELECTED CONNECTION PATHS TO I/O DEVICES AND MAINTAINING STATUS THEREOFxe2x80x9d filed Oct. 18, 1994 by D. A. Elko et al;
U.S. Pat. No. 5,706,432, entitled xe2x80x9cMECHANISM FOR RECEIVING MESSAGES AT A COUPLING FACILITYxe2x80x9d filed Jun. 7, 1995 by D. A. Elko et al;
U.S. Pat. No. 5,561,809, entitled xe2x80x9cIN A MULTIPROCESSING SYSTEM HAVING A COUPLING FACILITY, COMMUNICATING MESSAGES BETWEEN THE PROCESSORS AND THE COUPLING FACILITY IN EITHER A SYNCHRONOUS OPERATION OR AN ASYNCHRONOUS OPERATIONxe2x80x9d filed Apr. 11, 1995 by D. A. Elko et al.
The above listed patents and the present application are owned by one and the same assignee, International Business Machines Corporation of Armonk, N.Y. and are incorporated herein by reference.
The present invention can be used in a network of computers that form part of a distributed computer system. Such a distributed computer system typically includes a central host computer and a plurality of virtual machines or other types of execution environments. A real machine includes a central processor and associated virtual machines. Within each such real machine a central computer, that includes the central processor, manages central resources of the real machine including a large memory and communication facilities. The central processor controls the access between the virtual machines and the resources so that each virtual machine appears to be a separate computer. The real machines may in turn be interconnected through a network into a global network to enable communications between applications running in execution environments belonging to different real machines. Each virtual machine is provided with its own conversation monitor system (CMS) to interact with (i.e., receive instructions from and provide prompts for) users of the virtual machine. CMS is a portion of the system control program. Certain resources such as shared file system (SFS) and shared structured query language (SQL) relational databases may be accessed by any user of the virtual machine and the host.
Each such system is a real machine. Two or more real machines can be connected to form a network, and data can be transferred using communications between virtual machines belonging to different real machines. Such a transfer is made via communication facilities such as AVS Gateway and VTAM facilities (xe2x80x9cAVS Gateway and VTAMxe2x80x9d are trademarks of IBM Corp. of Armonk, N.Y.).
Application running on any of the virtual machines may communicate with the coupling facility as well as with other applications running on the same or different virtual machines. Applications communicate by sending a message to the coupling facility. Like files and databases, communications are also protected resources.
An application can make changes to a database, file resource, or state of communication by first making a work request defining the changes. In response to a request for a change, provisional changes are made in shadow files while the original database or file is unchanged. When changes are made to shadow files, they are not committed. The application have the option of requesting that the changes be committed to validate the shadow file changes. Thereby, the changes made to the shadow file is transferred to the original file.
A one-phase commit procedure is often utilized to commit changes to the original file. The one-phase commit procedure consists of a command to commit changes to the resource as contained in the shadow file. When resources such as SFS or SQL resources are changed, the commits to the resources can be completed in separate one-phase commit procedures. In the vast majority of cases, all resources will be committed using separate procedures without error or interruption. However, if a problem arises during a one-phase commit procedure, some of the separate commits may have already been completed while others may not, causing inconsistencies. Such a problem can be solved only by rebuilding resources. However, the cost of rebuilding non-critical resources is more than compensated by the improved efficiency of the one-phase commit procedure.
A two-phase commit procedure is required to protect critical resources and critical communications. For example, assume that a first person""s checking account is represented in a first database and a second person""s savings account is represented in a second database. If the first person writes a check to the second person and the second person deposits the check in his/her savings account, the two-phase commit procedure ensures that if the first person""s checking account is debited then the second person""s savings account is credited or else neither account is changed. The checking and savings accounts are considered protected, critical resources because it is very important that data transfers involving the checking and savings accounts be handled reliably.
An application program can perform the two-phase commit procedure using a single command. Such a procedure consists of the following steps, or phases:
(1) During a prepare phase, each participant (debit and credit) resource is polled by the sync point manager to determine if the resource is ready to commit all changes. Each resource promises to complete the resource update if all resources successfully complete the prepare phase i.e. are ready to be updated.
(2) During a commit phase, the sync point manager directs all resources to finalize the updates or back them out if any resource could not complete the prepare phase successfully.
The above described two-phase commit procedure ensures consistency of modification of critical resources in most cases. It is possible, however, that a message sent by the application to the coupling facility (by executing the common queues system (CQS) PUT command) fails during the last stage of the commit procedure, when all the other participants of the protected conversation already committed changes. In such a case, the changes that have already been made can not be backed out because the protected resources are polled for readiness during the first phase of the commit procedure. This problem can be solved by retrying CQS PUT command for the failed message. If this retry succeeds, the consistency of the protected resources will be restored.
However, the conventional techniques fail to provide a method for retrying CQS PUT procedure to restore consistency in the state of protected system resources.
It is therefore an object of this invention to provide a method for retrying CQS PUT command for messages for which a prior CQS PUT command failed during the last phase of the commit procedure.
Specifically, it is an objective of the present invention to provide a method for handling failed common queues system (CQS) PUT requests using unit of work elements.
It is another objective of the present invention to provide a system for handling failed common queues system PUT requests.
To achieve the objectives and the advantages of the present invention there is provided a distributed computer system comprising a plurality of execution environments and a coupling facility, wherein: each of said plurality of execution environments comprises a private storage memory for storing unit of work elements; a system log data set and an online log data set for logging data related to the activity of each of said plurality of execution environments; and said coupling facility further comprises a message facility for facilitating message exchanges between an application running in one of said plurality of execution environments and another application running in one of: (i) said coupling facility; (ii) an execution environment different from said one of said plurality of execution environments; and (iii) said one of said plurality of execution environments.
Further improvements include the above distributed computer system wherein each of said stored unit of work elements has a log token pointing to a log record containing data relevant to said each of said stored unit of work element.
Still further improvements include the above distributed computer system wherein each of said stored unit of work elements comprises a disk relative record number pointing to a disk record containing data relevant to each of said stored unit of work elements.
Still further improvements include the above distributed computer system wherein each of said plurality of execution environments further has a retry logic for handling said stored unit of work elements and recommitting failed messages to said coupling facility, wherein said failed messages correspond to each of said stored unit of work elements.
Another aspect of this invention is a distributed computer system wherein a PUT request corresponding to a common queues server for a committed message fails in a first attempt, said system comprising a means for determining if a unit of work element corresponds to the committed message, a means for flagging said unit of work element for xe2x80x9cretryxe2x80x9d, a means for accumulating said flagged unit of work elements in a private storage memory, a means for analyzing each of said flagged unit of work elements, a means for extracting a log token from each of said flagged unit of work elements, a means for using said log token to read a specific record from an IMS log data set, and a means for executing a second common queues server PUT command to send corresponding committed message to the coupling facility using said log record.
Yet another aspect of the present invention is a distributed computer system wherein a common queue server PUT request is handled comprising a means for determining if said common queues server PUT request failed; and a means for setting a common queues server PUT retry indicator and a log token in a corresponding unit of work element if said common queues server PUT request failed.
Still another aspect of the present invention is a distributed computer system wherein a failed common queues server PUT request fails said system comprising a means for reading an initial record, a means for determining if said initial record is a system log data set record or on-line log data set record, a means for performing following sub-steps if said record is a system log data set record: (i) reading log records from system log data set using log tokens; (ii) building a data object and setting a disk relative record number in a corresponding unit of work element; a means for performing the following sub-steps if said record is a on-line data set: (iii) reading log tokens from on-line data set; (iv) building a prefix update and setting a disk relative record number in a corresponding unit of work element; (v) building request list of all unit of work elements having common queues server PUT retry indicator; a means for performing the following sub-steps if each of said unit of work elements is determined to have a log token: (vi) reading log record from on-line log dataset using said log token; (vii) processing prefix update of the disk relative record number; a means for determining if a data object disk record number exists in said each of said unit of work elements if each of said unit of work elements is determined not to have a log token; and a means for executing common queues server PUT request for a data object to a coupling facility, if said log token exists and said data object disk record number exists.
Another aspect of the present invention is a method for handling a message corresponding to failed PUT request associated with a common queues server during a second phase of a two-phase commit procedure, said method comprising a step of retrying said PUT request.
Yet another aspect of the present invention is a method for retrying a PUT request corresponding to a common queues server for a committed message, wherein said PUT request failed in a first attempt, said method comprising: determining if a unit of work element corresponds to the committed message; flagging said unit of work element for xe2x80x9cretryxe2x80x9d; accumulating said flagged unit of work elements in a private storage memory; analyzing each of said flagged unit of work elements; extracting a log token from each of said flagged unit of work elements; using said log token to read a specific record from an IMS log data set; and executing a second common queues server PUT command to send corresponding committed message to the coupling facility using said log record.
Yet another aspect of the present invention is a method for handling a common queues server PUT request comprising steps of: determining if said common queues server PUT request failed; setting a common queues server PUT retry indicator and a log token in a corresponding unit of work element if said common queues server PUT request failed.
Yet another aspect of the present invention is a method for retrying failed common queues server PUT request comprising steps of: (a) reading an initial record; (b) determining if said read initial record is a system log data set record or on-line log data set record. (c) reading log records from system log data set using log tokens if said record in said step (b) is determined to be the system log data set record; (d) building a data object and setting a disk relative record number in a corresponding unit of work element if said record in said step (b) is determined to be the system log data set record; (e) reading log tokens from an on-line data set if said record in said step (b) is determined to be the on-line log data set record; (f) building a prefix update and setting a disk relative record number in a corresponding unit of work element if said record in said step (b) is determined to be the on-line log data set record; (g) building request list of all unit of work elements having common queues server PUT retry indicator; (i) determining if each of said unit of work elements has a log token; (j) reading log record from online log data set using said log token if said log token exists in step (i); (k) processing prefix update of the disk relative record number if said log token exists in step (i); (l) determining if a data object disk record number exists in said each of said unit of work elements, if the token if said log token does not exist in step (i); (m) executing common queues server PUT request for a data object to a coupling facility, if said log token exists in step (i) and said data object disk record number exists in said step (k).
Yet another aspect of the present invention is the above method, further comprising a step of determining the number of times when the common queues server PUT request has failed; comparing said number with a predetermined number of times; and aborting further retry attempts if the common queues server PUT command failed more that the predetermined number of times.
Another aspect of the present invention is a computer program product for a distributed computer system wherein a PUT request corresponding to a common queues server for a committed message fails in a first attempt, said program product including a computer readable medium comprising: a computer readable code for determining if a unit of work element corresponds to the committed message; a computer readable code for flagging said unit of work element for xe2x80x9cretryxe2x80x9d; a computer readable code for accumulating said flagged unit of work elements in a private storage memory; a computer readable code for analyzing each of said flagged unit of work elements; a computer readable code for extracting a log token from each of said flagged unit of work elements; a computer readable code for using said log token to read a specific record from an IMS log data set; and a computer readable code for executing a second common queues server PUT command to send corresponding committed message to the coupling facility using said log record.
Yet another aspect of the present invention is a computer program product for a distributed computer system said program product including a computer readable medium comprising: a computer readable code for determining if said common queues server PUT request failed; and a computer readable code for setting a common queues server PUT retry indicator and a log token in a corresponding unit of work element if said common queues server PUT request failed.
Yet another aspect of the present invention is a computer program product for a distributed computer system wherein a failed common queues server PUT request fails said program product including a computer readable medium comprising:
a computer readable code for reading an initial record;
a computer readable code for determining if said initial record is a system log data set record or on-line log data set record;
a computer readable code for performing following sub-steps if said record is a system log data set record:
(i) reading log records from system log data set using log tokens;
(ii) building a data object and setting a disk relative record number in a corresponding unit of work element;
a computer readable code for performing the following sub-steps if said record is a on-line data set:
(iii) reading log tokens from on-line data set;
(iv) building a prefix update and setting a disk relative record number in a corresponding unit of work element;
(v) building request list of all unit of work elements having common queues server PUT retry indicator;
a computer readable code for performing the following sub-steps if each of said unit of work elements is determined to have a log token:
(vi) reading log record from on-line log dataset using said log token;
(vii) processing prefix update of the disk relative record number;
a computer readable code for determining if a data object disk record number exists in said each of said unit of work elements if each of said unit of work elements is determined not to have a log token; and
a computer readable code for executing common queues server PUT request for a data object to a coupling facility, if said log token exists and said data object disk record number exists.