1. Field of the Invention
This invention relates generally to an improved system and method for performing fault recovery within a Symmetrical Multi-Processor (SMP) system having multiple processing partitions; and more particularly, relates to a system and method for isolating and handling faults within a failing partition in a manner that prevents the fault from creating a failure in a second, non-failing partition that shares at least one main memory segment with the failing partition.
2. Description of the Prior Art
Data processing systems are becoming increasing complex. Some systems, such as Symmetric Multi-Processor (SMP) computer systems, couple two or more Instruction Processors (IPs) and multiple Input/Output (I/O) Modules to shared memory. This allows the multiple IPs to operate simultaneously on the same task, and also allows multiple tasks to be performed at the same time to increase system throughput.
As the number of units coupled to a shared memory increases, more demands are placed on the memory and memory latency increases. To address this problem, high-speed cache memory systems are often coupled to one or more of the IPs for storing data signals that are copied from main memory. These cache memories are generally capable of processing requests faster than the main memory while also serving to reduce the number of requests that the main memory must handle. This increases system throughput.
Problems result where one or more of the system""s processors, instruction processors or I/O processors (hereafter referred to as processors and I/Os or processor units and I/O units), has an error, and that error is capable of corrupting an area of the main memory or any other memory that is or may be shared with other still-operating processors or I/Os. Losing the entire shared memory area for all the processors when only one or a small number are failing or involved with a failure of some kind is problematic for the steady state performance and overall throughput of the computer system. Accordingly, addressing this concern is a priority in computer systems where continuous or maximizing throughput is a requirement.
The system the invention developed for and of the preferred embodiment is a Symmetrical Multi-Processor (SMP) System (sometimes called a Cellular Multi-Processing (CMP) system) that is capable of being partitioned into multiple, independent data processing systems. That is, the hardware of the System may be sub-divided into multiple processing partitions. Each of the partitions includes or comprises predetermined processors, processor caches, peripheral devices, and portions of the main memory associated or dedicated to the partition. A dedicated Operating System (OS) controls the hardware associated to the partition. Hardware interfaces are configured appropriately within the system to ensure that messages and data are only passed between the processors and peripheral devices within the same partition. Processing occurs within a partition relatively independently of processing that is being performed in any other partitions. Communication between partitions may occur using shared address ranges within the main memory. The specific mechanisms used to accomplish this communication are described in detail in the U.S. Patent Application entitled xe2x80x9cComputer System and Method for Operating Multiple Operating Systems in Different Partitions of the Computer System and for Allowing the Different Partitions to Communicate with One Another Through Shared Memoryxe2x80x9d, referenced above.
By assigning a shared address range to multiple partitions of a data processing system, processors within different partitions may communicate efficiently. This is desirable when multiple partitions are performing related tasks. Alternative mechanisms of communication involve messages sent through input/output devices, and do not provide the throughput that a shared-memory scheme offers. However, utilizing shared memory presents unique problems related to error recovery. If a unit within a first partition fails such that main memory data that is shared between the first partition and a second partition is corrupted, the second (non-failing) partition may also experience a fault. This makes the entire data processing system less robust.
Another complication associated with the system of the preferred embodiment involves the use of write-back, versus store-through, caches. When write-back caches are employed, a copy of any data that is updated within a processor cache is not immediately stored back to main memory. The only copy of the updated data resides within the cache until the processor flushes the cached memory segment back to the main memory. Therefore, a failure within a partition may cause the only copy of valid memory data to be lost. To minimize this risk, it is important to allow all memory operations initiated by a partition prior to the occurrence of a fault to complete, even though subsequent operations will be abandoned to prevent corruption of system data.
One way to handle errors that affect memory data residing within a range of main memory shared between multiple partitions involves designating all shared data as unusable by both partitions. Although this recovery mechanism is relatively straight-forward to implement, it may result in the loss of a memory range that is critical to applications running on the non-failing partition. This approach does not provide a resilient error recovery mechanism.
Another mechanism for handling this problem involves allowing main memory to process memory requests following the issuance of a fault notification. According to this method, main memory determines, based on the receipt of an error indication, which memory requests should be serviced and which should be discarded. Because of latency between the detection of errors within the various units of the partition and the receipt of an error indication at the main memory, it may be difficult for the memory logic to determine which memory requests to process and which to discard. This may ultimately result in corruption of memory data. Moreover, by the time requests have been received by the memory, requests from the failing unit have already entered resources such as memory queues that are shared between the failing and non-failing partitions. This makes the process of determining which requests to process and which to discard more complex.
What is needed, therefore, is a system and method for recovering from an error within a first partition without affecting a second partition that shares main memory segments with the failing partition. The system and method should isolate errors as close to the failure as possible so that requests that are unaffected by the fault may be processed while requests made after the failure indication is received may be discarded.
In general, this invention provides an improved Symmetrical Multi-Processor (SMP) data processing systems and is particularly related to SMP systems having improved fault-handling capabilities. The invention is particularly geared toward providing a fault handling system for a multi-partition data processing system having multiple partitions that communicate via a shared main memory. Different forms of fault can call for variation in the process of fault handling and recovery in such systems. Elements of the invention provide for variable recovery with a goal of reducing or eliminating corruption of memory data and resilient error recovery. The kinds of errors or faults tracked by this system can be thought of as critical errors because they indicate unreliability of the system having the fault.
The present invention is particularly applicable to a hierarchical, multi-level, memory system that keeps track of all cache lines of data in a main memory, whether the owner of a cache line is in a local processor""s cache away from the main memory or not, and whether the main memory is distributed across multiple Main Storage Units, each subdivided into xe2x80x9cmemory clustersxe2x80x9d, as in the preferred embodiment or not.
(Main Storage Units are also called MSUs, and each MSU in the preferred embodiment may be populated with up to 4 xe2x80x9cmemory clustersxe2x80x9d, and as is shown these are organized into a main memory system in the preferred embodiment SMP system. A xe2x80x9ccache linexe2x80x9d is a unit in the preferred embodiment representing 64 bytes, although any organizational unit size into which a computer system""s main memory is organized could be employed. In our case, because the memory is organized into 64 byte sized chunks, i.e., cache lines, each of these has a directory entry, and 64 bytes is the size of a typical unit in which information is moved in our preferred embodiment system.)
The system should have an ability to mark the ownership state for each cache line through the tracking system (preferably a memory directory structure). In addition, the system needs to have the ability to mark each cache line as valid or invalid. The memory that keeps track of this is called a directory, and is described in U.S. Patent Application entitled xe2x80x9cA Directory-Based Cache Coherency Systemxe2x80x9d, referenced above. The directory of the preferred embodiment is stored in the main memory. This record keeping allows for a more satisfactory decommissioning of bad processing units, I/O units, and allows for some continuing use of shared memory where some system processors that share the memory have not failed.
More specifically, with a system for tracking all the memory units, (preferably cache lines) and where copies may reside and be valid throughout the SMP architecture, it becomes possible to isolate the errors as close in time to a failure as possible so that requests which are not affected by the fault may be processed, while requests made after the failure indication is received may be safely discarded. Also, by tracking the validity of every cache line in the system, shared memory partitions need not be entirely discarded, and failure of a single processor processor or I/O which may share a partition in memory need not cause other processors which may share that partition to go down.
A support processor preferably monitors the error condition of the system, and can assist in the replacement of downed processing and I/O units while other processing units and I/O units that may have shared a memory partition with the downed elements continue to operate normally without interruption so long as they have no need for cache lines owned by the downed elements, and possibly even in some instances where they do.
A process for xe2x80x9cpoisoningxe2x80x9d the cache lines owned by elements that need to be downed because of faults is described, and the system to implement it detailed. Errors detected by the elements themselves, or by the interfacing logic connecting the processing elements to the main memory system, are reported through a reporting system to the main memory system which poisons all cache lines (that is, indicates they are invalid) owned by the failing elements of the computer system, and for which have requests currently pending in the request path to main memory. The main memory system continues to poison cache lines as required when new requests for cache lines, owned by failed system processors, are issued by operational system processors. Errors are detailed in a register readable by a support processor that initiates further actions to ensure all cache lines owned by the failed elements are poisoned (because the operational requesters may not access all possible lines, for an indeterminate time). The support processor may provide further assistance in recovery for the non-failing elements that share the memory partition with the failing elements.
In the preferred embodiments the computer system processing elements are grouped into PODS, (Processing Modules) with 2 Sub-POD processor units, each of which can contain 4 processors, and 2 I/O modules, each of which contains 3 PCI Bus interfaces for connection to PCI devices. In this configuration a set of 4 error indicators is maintained for each of the POD requester ports (2 Sub-PODs and 2 I/O modules), within the POD""s xe2x80x9cTCMxe2x80x9d system. The TCM acts as a crossbar interconnect, to communicate across the 4 requester ports with 4 MSUs. An additional error indicator is kept for the TCM.
Faults that are critical are detected and reported via hardware initiated functioning. The hardware notifies the support processor of the event with a fault report. Hardware initiated functioning performs cache line poisoning for currently pending requests in the system, to cache lines owned by failed processing elements. The hardware continues to poison more cache lines as required by new requests that are received. Support processor initiated functioning forces fetch requests to the entire memory range shared by failed and operational requesters. This ensures that the hardware will see a fetch request for every possible cache line owned by a failed requester, within the entire shared memory range.
Failure of a subunit of the POD will cause only those cache lines owned by the failing subunit to be poisoned. The cache lines will be marked as xe2x80x9cpoisonedxe2x80x9d in the directory (preferable maintained by the main memory system). Failure of a TCM (POD) unit causes all ports from that POD to be considered failed and all cache lines owned by the POD""s processors and I/O to be marked as poisoned. In either event all functional parts of the SMP computer system continue to function while the fault handling is active. Operational processing elements that request a fetch of a poisoned cache line are notified of the poisoned state via an indication in the fetch response from the memory system. Appropriate recovery actions on a request basis may therefore be possible, but are beyond the scope of this invention.
Depending on the severity of the fault, the support processor may have to stop the failed partition, or may initiate actions to down (drop) a failing processing element from a partition that continues to function. As long as the fault is not associated with a particular MSU within memory system itself, the remaining partitions continue to function. The support processor also provides diagnostic information to allow efficient repair of the downed elements and for their expeditious replacement.
When the replacement hardware is installed, and/or a failed partition is restarted, the support processor is used to initiate actions to introduce replacement hardware into a partition and to restart partition(s) stopped due to the failure. If the same MSU hardware remains in the system, the support processor may also initiate actions to reclaim the poisoned memory range at this time. The memory range may be reclaimed for the new partition or be made available to other partitions. Specific support processor methods and any alternatives are beyond the scope of this invention.
The foregoing system provides a mechanism for recovering ranges of memory that are shared between multiple processing one or more failing units executing within a first processing partition, and one or more other operational units executing within a second processing partition. The recovery is performed in a manner that allowed the units within the second partition to continue operating despite the fault. The recovery mechanism is designed to render operational as much of the shared memory range as possible.