1. Field
The disclosure relates generally to data processing systems, such as multi-processor computer systems, and more specifically to systems and methods for repairing or updating the hardware in such systems.
2. Description of the Related Art
A multi-processor computer system includes multiple processing units. Each processing unit may include one or more processor cores. The processor cores carry out program instructions in order to operate the computer. Each processing unit may comprise one or more integrated circuit microprocessors having various execution units, buffers, memories, and other functional units, which are all formed by integrated circuitry. To facilitate repair and replacement of defective processing unit components, each processing unit may be constructed in the form of a replaceable circuit board, pluggable module, or similar field replaceable unit, which can be easily installed in or swapped out of the system in a modular fashion.
Each processor core may include one or more on-board caches implemented using high speed memory devices. Caches are commonly used to store temporarily values that might be repeatedly accessed by a processor. Use of a cache thus speeds up processing by avoiding the more time consuming process of loading the values from system memory. A processing unit can include a second level cache that supports the lower level caches that are part of the processor cores. Additional cache levels also may be provided.
Transistors forming the integrated circuits on silicon chips degrade over time and are susceptible to various errors. Transistors within arrays, such as transistor arrays forming cache memory, are particularly susceptible to such errors. Transistor errors within such an array may result in data corruption.
The control logic for a cache memory may include error correction code circuits to handle errors that arise in a cache line. A bit in a given cache block may contain an incorrect value either due to a soft error, such as stray radiation or electrostatic discharge resulting in a bit flip, or to a hard error, such as a defective cell. Error correction code can be used to reconstruct the proper data streams in the face of such errors. Some error correction codes can be used to detect and correct only single bit errors. In this case, if two or more bits in a particular block are invalid, then the error correction code might not be able to determine what the proper data stream should actually be. However, at least the failure can be detected. Other error correction codes are more sophisticated and allow detection or correction of multi-bit errors.
Error correction code circuits are one way to deal with soft errors arising in memory cells. Another approach, used for dealing with hard errors, is to provide redundancy within the arrays. If an array is found to be defective, a fuse can be used to indicate its defective nature. A hard fuse can be permanently blown or a soft fuse can be programmably set. A comparison then is made inside the array for each accessed address to see if it matches with a defective address. If so, appropriate logic re-routes the address to one of many extra row and column lines formed on the chip from redundant bit lines and word lines.
Redundancy thus provides for error correction by the logical removal from use of a defective array entry to avoid data corruption or a system outage. In the case of error correction code circuitry, a correctable error is the result of a single bit failure. Ideally, the corresponding array entry also is logically removed from use to prevent the correctable error from turning into an uncorrectable error in the presence of future soft errors.