This invention relates generally to processing within a computing environment, and more particularly to memory error isolation and recovery in a multi-processor computer system.
In a computing system within a Peripheral Component Interconnect Express (PCIe) environment, ordering rules ensure a uniformity to provide memory consistency. For example, when an input/output (I/O) adapter writes into system memory, using PCIe defined Posted Memory Write Requests, the updates in memory appear in order to the system software or device driver. In a typical I/O operation, an adapter writes a block of data followed by status into system memory. This operation usually requires several Posted Memory Write Requests and these requests must appear to the system software to be written in strict order in system memory. Therefore, if the system software polls the status waiting for a completion, it knows that any associated data previously written in system memory is valid. Interrupts from I/O adapters are called MSIs (Message Signaled Interrupts) and appear as Posted Memory Write Requests on the PCI interface. Because interrupts are Posted Memory Write Requests, they are also ordered with respect to other Posted Memory Write Requests and the other ordering rules described below. When the program receives an interrupt from an I/O adapter, it knows that all data and status information has been written in to memory and is valid.
Another ordering rule guarantees that when an adapter writes data into system memory and then fetches data from the same system memory address, it observes the new data just written. Still another ordering rule guarantees that when software reads data, perhaps just a single register in an adapter, when the read response is received by the software it knows that any previous Posted Memory Write data is visible in system memory. This rule is useful in synchronizing operation between the adapter and software.
However, errors may occur as data is written into system memory. In many computer systems, these errors either stop the entire computer or they leave an indication or footprint that the data is corrupted. This indication is often accomplished by storing data with a bad ECC (error correcting code) into system memory, which is often referred to as a ‘special uncorrectable error” (special UE). As a result, even if software views a good status, when it reads the data, it will see a special UE and know that the data is corrupted and can perform the appropriate recovery. If the data were not marked, the software would observe good data even though it was corrupted, which results in data integrity problems.
In some computer systems, including System z® servers offered by International Business Machines Corporation, certain errors in the memory subsystem will not mark the data as bad and will not update the memory at all. One example is a partial memory write with an uncorrectable error in a cache. In this case, the data in the cache remains in error; however, the copy in memory is not changed and therefore, contains stale data. Another example is an uncorrectable error in the storage key, reference, and change information for the corresponding data in system memory. As in the partial write case, the data in memory is not changed and therefore, contains stale data.