Most general purpose computer systems are built around a general-purpose processor, which is typically an integrated circuit operable to perform a wide variety of operations useful for executing a wide variety of software. The processor is able to perform a fixed set of instructions, which collectively are known as the instruction set for the processor. A typical instruction set includes a variety of types of instructions, including arithmetic, logic, and data instructions.
In more sophisticated computer systems, multiple processors are used, and one or more processors runs software that is operable to assign tasks to other processors or to split up a task so that it can be worked on by multiple processors at the same time. In such systems, the data being worked on is typically stored in memory that is either centralized, or is split up among the different processors working on a task.
Volatile memory, such as the dynamic random access memory (DRAM) most commonly found in computers, is able to store data such that it can be read or written much more quickly than the same data could be accessed using nonvolatile storage such as a hard disk drive or flash nonvolatile memory. Volatile memory loses its content when power is cut off, so while it is generally not useful for long-term storage it is typically used for temporary storage of data while a computer is running.
A typical random-access memory consists of an array of transistors or switches coupled to capacitors, where the transistors are used to switch a capacitor into or out of a circuit for reading or writing a value stored in the capacitive element. These storage bits are typically arranged in an array of rows and columns, and are accessed by specifying a memory address that contains or is decoded to find the row and column of the memory bit to be accessed.
The memory in a computer usually takes the form of a network of such circuit elements formed on an integrated circuit, or chip. Several integrated circuits are typically mounted to a single small printed circuit board to form a memory module, such as single inline memory modules (SIMMs) having a 32-bit memory channel for reading and writing data, or dual inline memory modules (DIMMs) having a 64-bit memory channel. Some more sophisticated types of memory modules include synchronous dynamic random access memory, or SDRAM, which runs in synchronization with the computer's bus, and double data rate (DDR) SDRAM or DDR2 SDRAM, which transfer data on both the rising and falling edges of the clock and have memory channel widths up to 64 bits of data and 8 bits of error management information per memory transfer. The memory chips are sometimes located centrally in multiprocessor computer systems, but are more typically divided among the various processors or system boards that make up the multiprocessor computer system.
Improvements in semiconductor technology over time include making memory chips and other digital logic chips smaller, faster, and operable to consume less power and therefore to generate less heat. But, the constant push to improve semiconductor performance and the imperfect nature of manufactured goods in general suggest that occasional flaws or imperfections will occur. Individual memory bit locations occasionally go bad, and sometimes even whole memory chips fail. It is also known that various electrical phenomena can regularly cause memory read or write errors or digital logic errors, such as electromagnetic noise causing a signal level to change or a cosmic ray changing the state of one or more bits of memory. Reductions in the size of digital logic and memory elements, and reductions in the voltage used to operate the memory, make such problems increasingly important to consider when designing computer systems.
Error management is therefore implemented in many computer systems, and is most typically embodied in a error checking bits such as a single parity bit per data byte that is operable to indicate when a single bit has changed state, or error correction codes (ECC) that can detect and often correct single-bit errors in memory systems. More advanced systems use multiple error correction bits to provide Single Error Correction/Multiple Error Detection capability, or SECDED, anticipating that multi-bit memory errors may continue to become more common as memory performance continues to push forward. Even though the reliability of individual digital logic chips and memory components is very high, the number of semiconductor devices in large multiprocessor computer systems and the cost involved with producing the digital logic and memory devices needed make error detection and correction an important consideration in large multiprocessor computer system design.