The invention relates to the field of data processing and, more particularly, to a data processing system and method to allow a restart following a system failure.
In the operation of a data processing system such as, for example, running IBM""s OS/390(trademark) operating system available from International Business Machines Corporation, one or more resource managers are provided to manage the resources of the data processing system. The resources may include, for example, both volatile and non-volatile storage, such as, online memory and direct access storage device (DASD) storage, as well as resource managers such as, for example, queue managers and data base managers, which perform insert, delete, increment and decrement operations. Conventionally such resource managers or systems are provided with a recovery log to store information needed to facilitate a restart of a resource manager in the event of a failure relating to the computer systems. It will be appreciated that such a failure may relate to a loss of power or the failure of a hardware device such as on board memory or a DASD holding a database.
U.S. Pat. No. 4,648,031 illustrates that it is known to write at specific operating points, a recovery log that is stored in non-volatile storage. Conventionally, the recovery log comprises a chronological record of processing events that have occurred within the data processing system and, typically, identify the units of work that have been undertaken by the data processing system. A Queue manager contains a recovery manager which is provided to co-ordinate a number of recovery operations which include the recovery of log records from the recovery log which are required for effecting a re-start.
Conventionally, a restart comprises a series of phases, which include a first phase commonly referred to as a status re-build phase. During the status rebuild phase, the status of incomplete units of work is established, a forward log range of the recovery log that must be traversed is established, a backward log range of the recovery log is also established together with a starting point for media recovery.
During a second phase, commonly known as a forward recovery phase, the recovery log is traversed forward from the starting point established during the status re-build phase to the tail end of the recovery log. During a third phase, conventionally known as a backward recovery phase, the recovery log is traversed backward to the starting point established in the status re-build phase from the tail end of the log.
During the forward and backward traversals, appropriate action is taken to render, for example, queues in a transaction consistent status, that is, the queues are recovered to a known condition. Any such action for a unit of work is known as a recovery process.
It will be appreciated that the lapsed time taken to effect a restart and the speed of restart processing is important to any business. For example, if the re-start of a database takes one hour, then that resource, which may be an insurance database, is not available for that hour and business cannot be conducted using the unavailable database.
In some circumstances the most significant restart variable in a transaction processing system is the time spent processing log information to provide transaction consistency and data integrity after a restart has been completed. Furthermore it will be appreciated that the introduction of old data files into a resource manager for a restart will require that these data files undergo media recovery operations, and incomplete units of work will need to be recovered or completed as part of the restart operation.
It will be appreciated that if one or more units of work during a restart operation are encountered that have been in progress for a relatively long period of time, such as, for example at least a day or two and, to take an even worse example, perhaps at least a week, the restart operation can result in the forward and backward recovery times being considerable.
For example, if it is discovered during a restart that there is a single incomplete unit of work that has been indoubt for two weeks, it can be appreciated that the restart process will take a considerable period of time, or, in the worst case, a restart using that pending unit of work may not be possible as the required log data may not be available. Conventionally, during the restart process, all log records relating to the indoubt unit of work would have to be read during forward recovery to lock the incomplete updates defined by the unit of work which prevents access to the data until the unit of work has been committed. If a unit of work is, as in this example, a number of weeks old, then prior log records for that unit of work may have been archived in off-line storage. The need to re-load and access such archived log records will further exacerbate restart time. Once the archived log records have been loaded, since they are typically stored on tape, the restart time may still take several hours since the log records must be read in a serial fashion.
If a single unit of work has been incomplete for two weeks and has a status of Inflight, again restart may take a considerable period of time, that is, restart may involve an extended backward recovery phase, or a restart may not be possible. During the restart process, all log records relating to the Inflight unit of work will have to be read during backward recovery to back out all of the updates defined by that unit of work. Again, as described above in relation to extended forward recovery times, there may be a need to retrieve old log records from an archive that is stored on magnetic tape.
It is an object of the present invention to mitigate at least some of the problems of the prior art.
Accordingly, a first aspect of the present invention provides a data processing method for a data processing system having a recovery log storing log records that can be used during recovery from a failure of the data processing system, the method comprising the steps of:
retrieving a unit of work from the recovery log;
determining whether or not the unit of work meets at least one predetermined criterion; and
removing the unit of work from the recovery log if the unit of work met the predetermined criterion.
Preferably, an embodiment is provided in which the predetermined criterion relates to the age of the unit of work.
Whether or not a unit of work is removed from a recovery log may depend upon that unit of work meeting a further criterion. Suitably, an embodiment provides a method further comprising the step of outputting a message relating to the unit of work requesting an indication of any preferred course of action for that unit of work; and receiving an input identifying the preferred course of action in relation to that unit of work.
It will be appreciated that the above step of outputting may output the message to a display device and solicit input from a user or message may be output to a message queue to solicit a response from an application.
Accordingly, a first aspect of the present invention provides a data processing method for facilitating a restart within a data processing system following a failure, the data processing system comprising, within persistent storage, a recovery log containing recovery log records which can be used during recovery from the failure of the data processing system, the log records relating to units of work undertaken by the data processing system, the method comprising the steps of:
retrieving, from the recovery log, a recovery log record relating to a unit of work;
determining whether or not the unit of work meets at least one predetermined criterion; and
performing a recovery process if the unit of work meets the predetermined criterion.
As recognised above, a significant problem associated with restart, that is, recovery from a failure, are units of work that have been incomplete or performing update activities that span a significant period of time. Suitably, an embodiment preferably provides a method in which the step of determining whether or not the unit of work meets the at least one predetermined criterion comprises the step of comparing the age of the unit of work with a threshold value.
Preferably, an embodiment provides a method in which the step of determining comprises the step of concluding that the unit of work meets the predetermined criterion if the age of the unit of work does not exceed the threshold value.
Alternatively or additionally, embodiments may comprise a method in which the step of determining comprises the step of concluding that the unit of work meets the predetermined criterion if the age of the unit of work exceeds the threshold value.
Once a unit of work has been identified as being problematical, action should be taken in relation to that unit of work to mitigate any potential adverse effects that unit of work may have on the recovery process.
Suitably, embodiments provide a method in which the step of determining comprises the steps of outputting a message comprising data relating to the unit of work; and receiving a response to the message which provides an indication of further processing to be undertaken in relation to the unit of work.
Preferably, embodiments provide a method in which the step of outputting a message comprises the step of outputting the message in a human-readable form and soliciting input of a preferred action to be performed in relation to the unit of work during the recovery process.
Alternatively or additionally, embodiments may comprise a method in which the step of outputting a message comprises the step of communicating data relating to the unit of work to an application for assessing at least one metric associated with the unit of work; and receiving a response from the application which provides an indication of a preferred action to be performed in relation to the unit of work during the recovery process.
Preferably, an embodiment provides a method in which the step of performing the recovery process comprises the step of effecting a predetermined action in relation to the unit of work. A preferred embodiment provides a method in which the step of effecting a predetermined action in relation to the unit of work comprises the step of forcing a commit operation in relation to the unit of work.
An alternative to creating a separate restart recovery log is afforded by embodiments that provide a method in which the predetermined action comprises removing the unit of work from the recovery log and in which the step of performing the predetermined recovery action comprises the step of performing a recovery action in relation to the recovery log having had at least the unit of work removed.
Embodiments provide a method in which the step of determining whether or not the unit or work meets a predetermined criterion comprises the step of determining whether the unit of work was pending at the time of the failure.
Preferably, embodiments may provide a method in which the step of performing the recovery process comprises the step of completing the unit of work. A preferred embodiment provides a method in which the step of performing the predetermined recovery process comprises the step of effecting a commit for the unit of work.
Advantageously, the removal of selected units of work from the recovery log allows the restart time to be significantly reduced. Preferably, the unit of work that meets the predetermined criterion undergoes a forced commit operation, that is, the unit of work is deemed to have been committed even though the unit of work may comprise updates that have yet to be completed.
A second aspect of the present invention provides a data processing system for facilitating a restart following a failure, the data processing system comprising, within persistent storage, a recovery log containing recovery log records which can be used during recovery from the failure of the data processing system, the log records relating to units of work undertaken by the data processing system;
means for retrieving, from the recovery log, a recovery log record relating to a unit of work;
means for determining whether or not the unit of work meets at least one predetermined criterion; and
means for performing a recovery process if the unit of work meets the predetermined criterion.
A third aspect of the present invention provides a computer program product for facilitating a restart following a failure within a data processing system, the data processing system comprising, within persistent storage, a recovery log containing recovery log records which can be used during recovery from the failure of the data processing system, the log records relating to units of work undertaken by the data processing system; computer program product comprises a computer readable storage medium having embodied thereon:
means for retrieving, from the recovery log, a recovery log record relating to a unit of work;
means for determining whether or not the unit of work meets at least one predetermined criterion; and
means for performing a recovery process if the unit of work meets the predetermined criterion.
Other inventive aspects of the embodiments of the present invention are defined in the appended claims.
A further aspect of the present invention provides a data processing method for a data processing system comprising a recovery log containing recovery log records relating to a plurality of units of work which have influenced a system resource of the data processing system, the method comprising the steps of
retrieving a recovery log record from the recovery log, assessing the unit of work associated with the recovery log to determine, whether or not a recovery process corresponding to the unit of work should be performed in relation to the system resource; and
performing the recovery process in relation to the system resource in accordance with the unit of work if the assessment does not indicate that the recovery process should not be performed; or
omitting to perform the recovery process in relation to the system resource if the assessment indicates that recovery process should not be performed.
Preferably, an embodiment further provides a method in which the step of assessing comprises the step of comparing at least one metric of the unit of work with at least one threshold value.
Still further embodiments provide a method in which the step of assessing further comprises the step of concluding that the recovery process should be performed if the metric of the unit of work does not exceed the threshold value. Alternatively, embodiments provides a method in which the step of assessing further comprises the step of concluding that the recovery process should not be performed if the metric of the unit of work exceeds the threshold value.