1. Technical Field
The present invention relates generally to an improved data processing system and in particular to a method and apparatus for facilitating redundancy in a data processing system. Still more particularly, the present invention relates to a method, apparatus, and computer instructions for identifying a spare processing unit in response to a failure of a processor in the data processing system.
2. Description of Related Art
As data processing systems become more advanced, the processing power within the systems has increased as new systems are released. One increase in processing power is provided through faster and better processors. Another increase in processor power results from using multiple processors within a data processing system. One type of multi-processor system includes the use of a multi-chip module (MCM). An MCM is a module or unit that contains multiple processor dies or chips on a single chip carrier. A chip carrier is a platform on which chips, passive components, device encapsulants, and thermal enhancement hardware are attached. These MCMs may include different numbers of chips, such as four or eight processing chips within a single MCM.
As an added feature in a data processing system, an additional MCM is often included in addition to the other MCMs. This spare MCM is employed to facilitate hot sparing of processors. In some cases, a number of processors within an MCM may be employed for hot sparing. In other words, these additional MCMs or processors are employed as replacements in case of a processor failure within the data processing system. The replacement processor replaces the failed one without requiring the data processing system to be restarted or reinitialized. One problem associated with this type of replacement of a failed processor is a reduction in processing efficiency. If a failed processor on one MCM is replaced with a failed processor on another MCM, the scattering of work load may affect the throughput or performance of applications.
The present invention recognizes that this problem occurs because of memory latency or cache affinity problems. A cache is an associative memory with respect to a processor chip. Many data processing systems use L1, L2, and L3 caches to increase performance. An L1 cache is located in a processor. An L2 cache located on a die and may be shared by all processors on the same die. An L3 cache is shared by all processors within an MCM. If a replacement processor for a failed processor is located on a different MCM, then any processing by those processors cannot use the L3 cache. In this manner, performance and throughput may be reduced because of this affinity problem with respect to the cache system.
Therefore, it would be advantageous to have an improved method, apparatus, and computer instructions for marking and selecting spare processors.