1. Field of the Invention
The present invention generally relates to data processing systems, and more particularly to a method of transmitting data using error correction codes.
2. Description of the Related Art
The basic structure of a conventional computer system includes one or more processing units connected to a memory hierarchy and various peripheral devices such as a display monitor, keyboard, network interface, and permanent storage device. The processing units communicate with memory and the peripheral devices by various means, including a generalized interconnect or bus. In a symmetric multi-processor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. An exemplary processing unit is the POWER processor marketed by International Business Machines Corp. The processing units can also have one or more caches, such as an instruction cache and a data cache, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory (i.e., random-access memory, or RAM). These caches are referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory. The memory hierarchy can include additional caches such as a level 2 (L2) cache which supports the on-board (level 1) caches. The L2 cache acts as an intermediary between system memory and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. Multi-level cache hierarchies can be provided where there are many levels of interconnected caches.
When providing memory values (instructions or operand data), the memory controller or cache controller can use an error correction code (ECC) circuit to detect and correct certain errors in the values received from the memory array for transmission to the requesting unit (i.e., processor). A bit in a value may be incorrect either due to a soft error (such as stray radiation or electrostatic discharge) or to a hard error (a defective cell). ECCs can be used to reconstruct the proper data stream. Many error control codes provide information about the specific location of the erroneous bit(s). Some ECCs can only be used to detect and correct single-bit errors, i.e., if two or more bits in a particular block are invalid, then the ECC might not be able to determine what the proper data stream should actually be, but at least the failure can be detected. Other ECCs are more sophisticated and allow detection or correction of double errors, and some ECCs further allow the memory word to be broken up into clusters of bits, or “symbols,” which can then be analyzed for errors in even more detail. These latter errors are costly to correct, but the design tradeoff is to halt the machine when double-bit (or higher-order) errors occur. Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy. The cache or system memory may be a “mark store” array which contains error information for each memory block or cache line. Whenever an error is encountered, the bit locations affected by the error can be stored in the mark store array for a particular rank in main memory. A rank in main memory refers to a specific memory module that accesses the cache line. Multiple memory modules can use a single cache, but only one module can access the cache line at a time.