1. Field of the Invention
The invention relates to a data processing environment consisting of one or more application servers processing requests, particularly time-critical requests, on behalf of at least one application and, more specifically, to a method and a system for controlling restart processing after termination of one or more resource managers that are involved in managing resources needed by an application server to fulfill application requests.
2. Background of the Invention
A typical data processing environment consisting of an application server for processing time-critical transactions is shown in FIG. 1. An application server implements a collection of related services which are requested from applications via application clients. Applications, application servers, and application clients may be implemented by computer programs of any nature not limited to a specific type of implementation. Application, application server, and application client are only logical structures; for example, an application server could be the application that requests services from another application server or even requests from itself.
Application servers typically run on the second tier of a three-tier system structure. In such a three-tier system structure the application client is on the first tier, requesting services from an application server on the second tier, which in turn requests services from back-end applications that are located on the third tier. The three-tier system structure is a conceptual structure. That means, even if it is typical to deploy the different tiers onto different computers, there is no need to do so; the deployment of the complete structure onto a single computer is a completely valid approach. It should be noted that a three-tier structure is just a special case of an n-tier structure. For example, if users are interacting via a Web browser, then the Web browser would be considered running on tier 0 and the associated web server which runs the application requesting services from the application server is considered running on tier 1.
Application servers are typically stateless, that is, they store any necessary state information into a persistent data storage. State information is data that the application server maintains between two subsequent requests from a client so that the individual requests from the application requesting services can be correlated.
Application servers, in general, run the processing of requests as transactions in the sense of ACID transactions which means all requests are either processed completely or not at all. A thorough representation of transaction technology is given by Jim Gray and Andreas Reuter, “Transaction Processing: Concepts and Techniques”, Morgan Kaufmann Publishers, Inc., 1993. In the case of an error, the transaction is aborted, all resources are reset to the state they have been before the transaction was started. For example, if the access to the relational database that maintains the application server state fails, all changes made to any of the involved resources are backed out, i.e. the message in a message queue that represents the request is put back into the queue from where it was read. This requires that all resources that are used by the application server for persistent data are managed by resource managers that can participate in transactions.
If any one of the involved software components fails, all components must be brought back to the state when the failure of the one or more components occurred so that processing can continue. The process of bringing a failed component back is called a restart process. The time it takes to restart a component depends on how much work needs to be undone and redone to re-establish the state of the component at the time of failure.
Resource managers maintain a log that contains entries about all changes that were applied to their managed resource. The log is typically maintained on highly available disks that, for example, exploit RAID technology. In addition, the resource managers periodically take snapshots of their current state and write this information as a checkpoint into the log. This information is used for recovery from resource manager failures, such as the abnormal termination of the resource manager or the loss of the media on which the resource manager maintains the resource. Recovering from an abnormal termination of the resource manager is called crash recovery, while recovering from a media failure is called media recovery. It should be noted that media recovery is no longer considered an issue in resource manager recovery due to the high availability of the disks on which the data is maintained and is therefor not considered in the present invention. The component of resource managers that is responsible for recovering from failures is typically called the restart manager. When the resource manager is restarted after the crash, the restart manager uses the checkpoint information and the individual log entries to reconstruct the latest state; that means the restart component needs to figure out which transactions need to be aborted, such as those executing when the crash occurred, and which committed updates need to be redone. This is done by locating the checkpoint record and processing all log records that are coming after the checkpoint record. Thus, the frequency with which checkpoint records are taken, determines the amount of processing that is required to restart the failing component after a failure.
In the prior art approaches, different resource managers have different policies when to write a checkpoint record. These policies are typically specified via some global settings, such as the number of processed messages in a message queuing system or the time period between two subsequent checkpoints in a relational database management system. These settings may be fixed so that they can not be changed by the user, as for example in a message queuing system MQSeries where the number of processed messages is set to 1000; or these settings may be variable, as for example in a relational database management system DB2, where the time between checkpoints can be set by the user.
A resource manager typically serves multiple applications such as multiple different application servers. In this case, the log is used for all changes caused by all applications. However, some resource managers, such as DB2, allow multiple instances to run. An instance consists of all the information needed to support users; it has its own catalog for managing metadata, own log, and appropriate user data. A particular application server can be associated with a particular instance. In this case, the log is only used for the operations of the particular application server.
The checkpoint record frequency settings for each of the involved resource managers determines how long it will take to recover from the crash of one or more of the resource managers. As each participating resource manager takes checkpoints independently, the maximum restart time of the application server can be calculated as the restart time of the resource manager with the longest restart time plus the restart time of the application server itself.
Typically, the restart time is obtained by running simulations with the application server. Since simulation never matches real-life situations, the estimated restart time is only an approximate value.
Taking checkpoints is not only a time-consuming operation, it also slows down processing of requests as the resource manager must dedicate processing and I/O resources for taking the checkpoint. Since there is no correlation between the processing of the application server and the frequency of checkpoint taking by the involved resource managers, checkpoints may be taken, even if not required.