1. Field of the Invention
The present invention in general relates to computer systems, and in particular, to detecting errors in computer systems by using state tracking. Even more specifically, the invention relates to methods and systems that are well suited for detecting such errors in multiprocessing computer systems.
2. Background Art
Multiprocessor computer systems are becoming increasingly important in modern computing because combining multiple processors increases processing bandwidth and generally improves throughput, reliability and serviceability. Multiprocessing computing systems perform individual tasks using a plurality of processing elements, which may comprise multiple individual processors linked in a network, or a plurality of software processes or threads operating concurrently in a coordinated environment.
Many early multiprocessor systems were comprised of multiple, individual computer systems, referred to as partitioned systems. More recently, multiprocessor systems have been formed from one or more computer systems that are logically partitioned to behave as multiple independent computer systems. For example, a single system having eight processors might be configured to treat each of the eight processors (or multiple groups of one or more processors) as a separate system for processing purposes. Each of these “virtual” systems would have its own copy of an operating system, and may then be independently assigned tasks, or may operate together as a processing cluster, which provides for both high speed processing and improved reliability.
The International Business Machines Corporation zSeries servers have achieved widespread commercial success in multiprocessing computer systems. These servers provide the performance, scalability, and reliability required in “mission critical environments.” These servers run corporate applications, such as enterprise resource planning (ERP), business intelligence (BI), and high performance e-business infrastructures. Proper operation of these systems can be critical to the operation of an organization and it is therefore of the highest importance that they operate efficiently and as error-free as possible, and rapid problem analysis and recovery from system errors is vital.
The IBM zSeries server product line provides Enterprise Level Computing solutions, which place great importance on maintaining a very high level of system availability and thus on recovering from system errors. The zSeries Channel Subsystem (CSS) has matured to support large I/O configurations, but because of this, increased time may be needed to recover the I/O Subsystem when the system encounters an error.
This CSS maintains a logical representation of the system's I/O Configuration state via internal data structures or controls blocks. These control blocks are used to contain state information for the various operations and tasks that the CSS executes and also to serialize Processing Unit (PU) operations in a Multi-Processing (MP) environment.
A large multiprocessor computer system, such as the IBM zSeries servers, maintains a large state space in data structures (control blocks). Each task in this system modifies a (small) portion of this state. If a task—due to a hardware failure or a code bug—does an erroneous or incomplete modification to that state, this may go unnoticed for an undefined amount of time (until this state is inspected again by a subsequent task). This item of the state space may affect a single or multiple components of the system (devices etc.).
In the past, there was no way of quickly determining which portions of the large state space were currently active (in the process of being modified). When an error occurred, the entire state space had to be assumed to be inconsistent. As a result, this entire state space had to be scanned for activity in order to bring it back to a consistent state.