1. Technical Field
The present invention is directed to an apparatus and method of repairing a processor array for a failure detected at runtime.
2. Description of Related Art
The IBM pSeries computing systems contain several advanced features intended to enhance the availability of systems. One such feature is persistent deallocation of system components, such as processors and memory. Persistent deallocation provides a mechanism for marking system components as unavailable and preventing them from being configured into systems during system boot. The service processor firmware marks the components unavailable if the component failed a test at system boot, had an unrecoverable error during run time, or if the component exceeded a threshold of recoverable errors during run time suggesting that it might be more susceptible to an uncorrectable error later on.
Another such feature of the IBM pSeries computing systems is called dynamic deallocation for system components, such as processors and memory. This feature allows a component to be removed from use during run time should the component exceed a threshold of recoverable errors.
Processors shipped in many of the pSeries systems have internal arrays such as L1 or L2 caches. An advanced feature of these arrays is the incorporation of extra memory capacity that can be configured on a bit by bit basis to replace failed array elements. Configuring this extra memory capacity allows for hardware that can repair around damaged arrays and can continue to function without replacement or degradation.
Originally these spare bits were only used when an error in an array was detected during system boot. This made the extra memory capacity feature useful for repairing processor arrays during the manufacturing process. However, for systems already shipped to the end-user, the function could not be effectively utilized because, in functioning systems, array bits that go bad will tend to be detected during runtime as opposed to at system boot. Because of this, the previously mentioned mechanism of persistent deallocation marks the processor component as bad without ever invoking the mechanism to determine if the array could be repaired.
Thus, it would be beneficial to have an apparatus and method for invoking the mechanism to determine if an array can be repaired and to repair the array if possible, before the processor component is marked as bad by the persistent deallocation mechanism.