1. Field of the Invention
The present invention in general relates to computer systems, and in particular, to multiprocessing computer systems. Even more specifically, the invention relates to methods and systems to execute recovery in non-homogeneous multi processor environments.
2. Background Art
Multiprocessor computer systems are becoming increasingly important in modern computing because combining multiple processors increases processing bandwidth and generally improves throughput, reliability and serviceability. Multiprocessing computing systems perform individual tasks using a plurality of processing elements, which may comprise multiple individual processors linked in a network, or a plurality of software processes or threads operating concurrently in a coordinated environment.
Many early multiprocessor systems were comprised of multiple, individual computer systems, referred to as partitioned systems. More recently, multiprocessor systems have been formed from one or more computer systems that are logically partitioned to behave as multiple independent computer systems. For example, a single system having eight processors might be configured to treat each of the eight processors (or multiple groups of one or more processors) as a separate system for processing purposes. Each of these “virtual” systems would have its own copy of an operating system, and may then be independently assigned tasks, or may operate together as a processing cluster, which provides for both high speed processing and improved reliability.
The International Business Machines Corporation zSeries servers have achieved widespread commercial success in multiprocessing computer systems. These servers provide the performance, scalability, and reliability required in “mission critical environments.” These servers run corporate applications, such as enterprise resource planning (ERP), business intelligence (BI), and high performance e-business infrastructures. Proper operation of these systems can be critical to the operation of an organization and it is therefore of the highest importance that they operate efficiently and as error-free as possible, and rapid problem analysis and recovery from system errors is vital. It may be noted that logical partitioning on an IBM zSeries server means that the physical processors are virtualized. This means that the system can be configured to treat each of the virtual processors (or multiple groups of one or more virtual processors) as a separate system for processing purposes.
A large multiprocessor system, such as the IBM zSeries servers, maintains a large state space in data structures. Usually many of these structures are shared. Each task in the system modifies a (small) portion of the overall state. Such a task possibly can—due to a hardware or a code error—do an erroneous or incomplete modification of the state. This item of the state space may affect a single or multiple components of the system. In any case, an effective recovery actions is required to restore consistency.
The traditional approach is to first collect a system wide overview of the pending recovery actions to be performed. A single processor then executes the recovery, while the other affected ones are kept in a secure state. While this approach is suitable for small and homogeneous systems, it usually cannot be applied to large, non-homogeneous systems. There are two reasons for that:
A single processor would be required that is technically able to perform all recovery actions. However, in large systems, usually not all processors do have the same capabilities. A single processor capable to perform all possible kinds of recovery actions often does not exist.
Overall recovery execution time is a problem in large systems, since all processors affected by the error are unresponsive to outside requests while doing the recovery. Therefore parallel execution of recovery for the affected processors is required in order to keep the recovery execution time at a minimum.