1. Description of the Related Art
Developments in personal computers have included faster clock speeds for the processor and buses and devices connected to the buses or to various devices attached through interfaces to the computer system. In addition to the developments in clock speed, various other developments have enhanced the processing ability of personal computers, including, but not limited to, larger main memory sizes, internal and external cache subsystems, larger and faster hard drives, faster CD ROM drives and faster modems and networking connections.
Memory modules have long been used in arrays of several modules to provide the main memory for personal computer systems. The use of memory modules has permitted computer makers and users to scale the size of any particular computer's main memory to the desired size. Combinations of memory modules having different sizes installed in the same memory array permit many ranges of scalability. Recently, the size of the memory modules has increased into the gigabyte range.
Two of the most commonly known memory module types are the single in-line memory module (SIMM) and the dual in-line memory module (DIMM). Generally, a SIMM has a line of memory chips on a single printed circuit board (PCB) with a single edge connection. A DIMM, on the other hand, uses a very similar construction, but utilizes both sides of the printed circuit board to provide almost double the memory capacity in almost the same amount of physical space.
Memory accesses, such as from a bus, may be to a single byte of data, or digital information, stored at a single address space or to a large chunk of data stored in contiguous address spaces. Accesses to a large number of contiguous address spaces permits the memory subsystem to perform the data transfer in a direct memory access (DMA), whereby each byte, word, double-word, etc. of data in the contiguous address space is quickly read, written or otherwise accessed, without help from the processor.
Commonly, memory accesses even to a single address space will cause the memory controller to access a larger number of contiguous address spaces which includes the desired address. By doing so, the memory controller accommodates the cache functions of the computer system. A cache is a small, intermediate, fast memory subsystem between a fast processor and a slower memory subsystem. The purpose of a memory cache subsystem assumes that a memory access to a particular address space will usually be followed by a memory access to the next contiguous address space, and so on for several memory accesses. The cache subsystem quickly accesses a larger number of address spaces, referred to as a cache line, surrounding the requested memory address space. The cache line is stored in the cache memory, a memory device with a faster response time than the main memory. Subsequent memory accesses to addresses in the same cache line may be responded to by the cache subsystem much more quickly than by the main memory, so the processor, or other device requesting the memory access, does not have a long waiting period for the access to complete. To provide a cache line, the memory modules may be accessed in memory blocks containing about 16, 32, 64 or 128 bytes or other size depending on the type of processor in the computer system.
Due to various reasons, the data retrieved from a location in a memory module may contain an error. For example, one of the bits may have the opposite value when read than it had when the data was written to the address space. To permit the memory subsystem to check for errors, data may be written with additional bits which, along with the data bits, may be decoded to determine whether one or more of the bits is wrong. For example, 64 bits of data may be stored with 8 additional bits, for a total of 72 bits, so that error checking and correcting (ECC) logic in the memory subsystem can decode all 72 bits to determine the location of an erroneous bit and to correct it before returning the data in response to the memory read access.
An uncorrectable error is one for which the ECC logic cannot determine the location of the error (e.g. there may be too many erroneous bits) and can be fatal to the computer. Since the memory subsystem cannot determine what the information is supposed to be, the processor may interpret it as an invalid command, or a command that sends the processor to perform a completely incorrect function. Either way, the computer system may crash and have to be shut down and rebooted.
An uncorrectable error may be preceded by a number of correctable errors at the same location. Thus, if the memory subsystem or the system software can keep track of the correctable errors that occur in the entire memory array, then a potential risk of a fatal error may be detected before it occurs, and the memory module containing the failing location may be replaced before a catastrophic event occurs to cause a user or an enterprise to lose valuable data or time in performing work. It is, therefore, desirable to have a way to fail-over, or move to a different location, the data before the problem with the memory module causes an uncorrectable error, resulting in a system crash. The most common problem when a memory module starts to develop errors is typically not due to the entire memory module. Rather, the initial problem is usually due to just one of the cells storing just one bit that has developed a soft, or correctable, error, while the remainder of the memory module, which may contain anywhere from kilobytes to gigabytes of memory, is still good and useable. Thus, failing-over an entire memory module due to an error in a single bit in one memory block is a bit of over-kill. It would be more desirable to fail-over a much smaller chunk of memory, so the standby memory module need not be as large as the largest primary memory module, thereby saving the cost of a large standby memory module. Another advantage in failing-over a smaller chunk of memory would be in the time saved to perform the transfer of information from the failing memory module to the standby memory module, so delays in arbitrating for the memory bus for other memory accesses will be minimized, and the overall performance of the computer system will not be affected.
Errors also tend to occur in a random fashion, wherein one memory block in one memory module may have one bad bit, while the next bad bit may be in another memory block in a different memory module. Thus, in the above example, unless the computer user can replace a failed-over memory module as soon as possible after the fail-over is complete, there is a risk of catastrophic failure to the computer, since the memory system will not be able to fail-over another memory module after another memory error occurs.