1. Field of the Invention
The present invention in general relates to computer systems, and in particular to multiprocessing systems. Even more specifically, the invention relates to state tracking and recovery in multi-processing computing systems.
2. Background Art
Multiprocessor computer systems are becoming increasingly important in modern computing because combining multiple processors increases processing bandwidth and generally improves throughput, reliability and serviceability. Multiprocessing computing systems perform individual tasks using a plurality of processing elements, which may comprise multiple individual processors linked in a network, or a plurality of software processes or threads operating concurrently in a coordinated environment.
Many early multiprocessor systems were comprised of multiple, individual computer systems, referred to as partitioned systems. More recently, multiprocessor systems have been formed from one or more computer systems that are logically partitioned to behave as multiple independent computer systems. For example, a single system having eight processors might be configured to treat each of the eight processors (or multiple groups of one or more processors) as a separate system for processing purposes. Each of these “virtual” systems would have its own copy of an operating system, and may then be independently assigned tasks, or may operate together as a processing cluster, which provides for both high speed processing and improved reliability.
The International Business Machines Corporation zSeries servers have achieved widespread commercial success in multiprocessing computer systems. These servers provide the performance, scalability, and reliability required in “mission critical environments.” These servers run corporate applications, such as enterprise resource planning (ERP), business intelligence (BI), and high performance e-business infrastructures. Proper operation of these systems can be critical to the operation of an organization and it is therefore of the highest importance that they operate efficiently and as error-free as possible, and rapid problem analysis and recovery from system errors is vital.
In normal operation, a partitioned system operates in parallel, that is, the operations being performed by the partitions can occur simultaneously as the partitions share the operational resources of the server. With everything functioning properly, the various partitions, which may be operating using different operating system, perform their functions simultaneously.
There are certain critical functions, however, that require serialization of the system for a short period of time. Serialization is the forcing of operations to occur in a serial, rather than parallel, fashion, even when the operations could be performed in parallel. Serialization is typically mandatory when the correctness of the computation depends upon or might depend upon the exact order of computation, or when an operation requires uninterrupted use of otherwise shared hardware resources (e.g., I/O resources) for a brief time period.
The IBM zSeries server product line provides Enterprise Level Computing solutions which place great importance on maintaining a very high level of system availability and thus on recovering from system errors. The zSeries Channel Subsystem (CSS) has matured to support large I/O configurations, but because of this, increased time may be needed to recover the I/O Subsystem when the system encounters an error.
This CSS maintains a logical representation of the system's I/O Configuration state via internal data structures or controls blocks. These control blocks are used to serialize Processing Unit (PU) operations in a Multi-Processing (MP) environment and contain state information for the various operations and tasks that the CSS executes.
A PU executing an I/O operation will acquire and release locks on control blocks as part of I/O processing. If a PU fails during an I/O operation, it is necessary to locate and recover the control blocks held by the failing Processor. The current CSS recovery design employs a “scan” recovery method of all I/O control blocks in the system configuration, looking for control blocks that were in use by the failing Processor Unit (PU). This method is time consuming when all I/O controls blocks must be scanned and evaluated to locate the few that actually require recovery.
The resultant recovery times can also affect the overall system operation:                Recovery has the highest priority, other normal operations requiring the processor doing recovery will be delayed, sometimes long enough to require additional recovery;        Other processes which require only one or more control blocks being recovered may have to wait excessive amounts of time for the control block to be freed by recovery, again sometimes long enough to require additional recovery.        
These recovery times are increasing because the number of I/O control blocks allocated per channel on the zSeries servers has increased. Specifically, the number of control blocks on the zSeries servers has increased from 512 K per system to over 7000 K per system.