The invention relates generally to computer systems, and deals more particularly with error recovery in a hierarchical storage system.
Hierarchical storage systems are known for storing information in computer systems. Typically, a hierarchical storage system includes a number of levels in which, for any adjacent levels, one level is subordinate to the other.
For example, the incorporated patent applications describe a computer system with multiple CPUs, main memory and direct access storage, and a cache system interposed between the multiple processors on the one hand and the main memory and direct access storage on the other hand. Each processor is served by a respective one of a plurality of first level (L1) cache subsystems for storing data or instructions. All L1 subsystems are coupled to a higher level (L2) cache subsystem containing data or instructions for the plurality L1 of cache subsystems. Main memory (level 3, or L3) and direct access storage are coupled to the L2 cache subsystem through a storage controller (SC).
The trend toward multi-processing in modern computer systems and the need for reliability and availability of parallel processors have placed substantial demands on hierarchical storage systems. In order to enhance reliability and availability, many multi-processor designs include instruction-level retry to recover from sporadic, intermittent hardware failures. With the unremitting evolution of modern computer technology driving more and more circuits into smaller and smaller configurations, processor designs are becoming increasingly complex. In addition, pipelining and parallel operations are provided to improve processor performance, at the cost of increasing the complexity of normal instruction execution sequences. This increased functional complexity makes instruction retry extraordinarily difficult, particularly in a hierarchical storage system where storage subsystem levels are used in common by multiple independent processors or by multiple concurrent operations, or by both.
One proposed technique for identifying and recovering from hardware errors in pipelined processing computer systems is taught in U.S. Pat. No. 4,924,466, commonly assigned with this application, and incorporated herein by reference. In the '466 patent, a multi-processing, pipelined computer system with a hierarchical multi-level storage system is partitioned into retry domains. Each retry domain comprises hardware devices and a trace array. The trace array is a record of the execution of a sequence of events that provides a history of an operation occurring in a retry domain. When an error is detected, the storage system is quiesced. In this regard, "quiescing" refers to the process of bringing processing to a halt by rejecting new requests for command execution. Following quiescence of the storage system, recovery is conducted by a service processor (SP). In the incorporated '466 patent, the trace arrays form a hierarchical structure with entries that are linked by an event trace ID. Such linking underpins recovery of the linked retry domains by cooperative operation of the domains.
The prior art does not provide the ability to separately and independently quiesce the operations of respective levels of a hierarchical storage system at respective checkpoints at which information about the status of operations in each of the respective levels is available to recover and restart each level. It would be advantageous to restart all quiesced levels in response to a single, system-wide restart command that ensures synchronous restart of all components within a level and synchronous restart of all levels.