The present invention relates in general to recovery scope systems and procedures used in large computer systems, such as large database or application servers, a dynamic multiple address space engaged by concurrent processes accessing multiple protected resources such as mass storage devices (disk and/or tape drives), and in particular to recovery scope management systems and methods for recovering protected resources in such an environment, during server restarts and in response to partial server failures, through the use of nested recovery techniques using stateless recovery agents.
Modem computer systems have tremendous computing power, which is needed for large e-commerce applications and other large data handling tasks involving large numbers of transactions. Such advanced computer systems are frequently involved in the reading and writing of large volumes of data to storage devices, such as large-capacity disk drives and tape drives, which are known as protected resources. Such disk drives and tape drives may contain multiple databases and/or large files that need to be accessed and/or updated regularly in a highly reliable fashion, such as the familiar two-phase commit process. The well-documented standard process is designed to satisfy the well-known ACID test for reliable data processing storage. ACID stands for atomicity, consistency, isolation, and durability, which are the four primary attributes which a transaction processing system always attempts to ensure for any transaction. Often, a single business transaction may involve the updating of more than one protected resource, for example a database record. Modem operating system and application management software, with its multiprocessor, multitasking and multithreading capabilities, is able to supervise the processing of thousands of transactions concurrently. There liability of data storage in these transactions are assured in part by the various roles played by transaction managers and resource managers in the overall process, including meeting all of the requirements of the two-phase commit process.
In such large systems, the management and application software may implement such scalability and performance characteristics by distributing work requests to multiple address spaces or server regions The number of server regions required can be managed or otherwise adjusted by work load management systems or monitors to ensure installation performance goals are met. Server regions are therefore dynamically started and stopped based on workload.
In such large systems, the management and application software stops and restarts various server regions associated with transactions being written a single storage device or storage devices. As is explained further below, advanced computer systems dynamically monitor work flow and processing loads, and allocate different controllers and servers to different transactions and operations by the use of multiple concurrent processes. All of this is needed to efficiently handle the large data processing loads, including but not limited to e-commerce and web server environments, where the volume and type of work being performed may change or fluctuate during any given hour or from hour-to-hour and day-to-day.
The present invention is concerned with issues which arise in the restarting of server regions, and the handling of the recoveries from abnormal terminations, such partial or complete server failures, and from shutdowns or other disturbances of the equipment and/or processes resulting in locked or in-doubt transactions subject to efforts to recover from same. Such protected resource usages are normally marked in a recovery log as locked or in-doubt or otherwise assigned a failure status or given a suspect status. In such situations, a recovery manager using conventional procedures normally attempts to recover by either completing or rolling back these affected transactions, or otherwise restoring the data to some well-defined consistent state.
In practice, it can be difficult to determine when the recovery of a protected resource must occur in all situations. “Recovery during restart” methods are generally known in the art. In a typical method of this type, a recoverable component will read some hardened data (e.g., a recovery log) at server initialization to determine what recovery (if any) needs to take place. Once the recovery actions are determined, they take place during server initialization or after initialization completes. Recoveries are often made at the level of individual transactions.
While the recovery during restart method works for simple environments where the server consists of a single address space, the same conventional recovery methods have problems when the server model is extended to multiple processes, such as are found in multiple controller cluster arrangements where each controller typically has one or more server regions. Two significant problems arise during recovery efforts in such environments. First, recoverable components typically attempt to perform recovery during the initialization of each address space. If a new address space is created on behalf of a particular server to handle an increase in workload, for example, the recovery action can adversely effect work that is in-flight and executing in a different address space of the server. In the worst case, this action can produce at least some data integrity problems which might need to be manually addressed and corrected which is time-consuming, error prone and expensive.
The second significant problem arises when components that perform recovery at server restart are not able to perform recovery in the event of a partial server failure. In particular, if a single address space or thread pool contained within a server fails, recovery actions required to put protected resources into a consistent state may not execute until the entire server is restarted. In the case of transactional resources, it sometimes happens that application related data will be locked by a resource manager indefinitely. In other words, in a “recovery on restart” approach, the initialization of the Nth servant region (SR) attempts to perform recovery of protected resources, in accordance with the common practice in the server industry. Such a recovery normally involves examining the server's recovery log and resolving work contained within that log, again in accordance with conventional recovery practices used in single address space environments. It has been found that at times in this kind of multiple server region cluster arrangement, work that the initializing process was attempting to resolve was also currently executing in another process of the server and thus recovery was adversely affected.
Accordingly, there is a need to somehow overcome these two problems in a multiple process computer environment having multiple server regions so that work can more often be successfully recovered in a rapid and preferably fully automatic way to bring the affected protected resources back on line more quickly, while at the same time reliably recovering all transactions and/or data that it is possible to restore to a consistent state.