1. Technical Field
The present invention relates generally to an improved data processing system, and in particular to a method and apparatus for handling multiple bit errors in a data processing system. The present invention also provides a set of computer-implemented instruction for serving multiple bit errors.
2. Description of Related Art
In a large memory system, errors from memory device may cause a catastrophic system error. A hard error means that a cell within the memory device is permanently defective. On the other hand, a soft error is a temporary fault such as when a data bit is lost. With soft errors, the memory device still functions correctly after the data is rewritten into the memory cell. There are many causes of soft errors, such as alpha particles, noise on power or control signals, temperature extremes, marginal timing, or the like.
Today computer systems with high availability requirements use error detection logic and parity to ensure data integrity and system reliability. For computer hardware with high failure rates (e.g. system memory, cache, etc.), error correction code (ECC) logic is used to correct single bit error. Such ECC logic helps to prevent an immediate failure of the system and improve overall system availability.
A system memory is the central storage in the computer where program and data reside and waiting to be processed by the processor. A cache is a temporary storage area close to or internal to the processor that allows speedy access to program or data. An array is a term generally refer to as smaller arrangements of temporary memory storage, including cache. A cache or memory address is a reference to a physical location within the cache or memory storage which store one or several bytes of computer instruction or data. A cache line is a block of addresses or physical locations within the cache, usually a group of 128, 256 or 512 bytes of data. Such architecture of line addressing may also apply to any memory system.
A cache or memory address with repeating single bit errors indicates a hard error condition which requires continuous error correction by ECC logic. A cache or memory with a single hard error, if left in the system for an extended period of time, may lead to an incorrectable error condition and system outage due to an occurrence of second hard error within the same or adjacent physical address location. A typical ECC logic can only handle single bit error. To prevent potential system failure for a computer system with high availability requirement, it is a general practice to replace the cache or memory with single bit hard error. However, frequent replacement of parts can lead to high service cost for the computer manufacturer and poor system reliability perception by the customer.
Therefore, it would be advantageous to have an improved system to minimize service cost and to improve system reliability by having the capability to continuously run the system without replacing the part with single-bit hard error and even when a second hard error occurs. It would further be beneficial to have an apparatus and method that allows scheduling of maintenance time after a second hard error occurrence but before a catastrophic error or system down time.