1. Field of the Invention
Generally, the present invention relates to a fail-over feature for handling of memory access errors in a computer system. More specifically, the fail-over feature relates to dynamically replacing or re-mapping memory locations experiencing unacceptable errors.
2. Description of the Related Art
Developments in personal computers have included faster clock speeds for the processor and buses and devices connected to the buses or to various devices attached through interfaces to the computer system. In addition to the developments in clock speed, various other developments have enhanced the processing ability of personal computers, including, but not limited to, larger main memory sizes, internal and external cache subsystems, larger and faster hard drives, faster CD ROM drives and faster modems and networking connections.
Memory modules have long been used in arrays of several modules to provide the main memory for personal computer systems. The use of memory modules has permitted computer makers and users to scale the size of any particular computer""s main memory to the desired size. Combinations of memory modules having different sizes installed in the same memory array permit many ranges of scalability. Recently, the size of the memory modules has increased into the gigabyte range.
Two of the most commonly known memory module types are the single in-line memory module (SIMM) and the dual in-line memory module (DIMM). Generally, a SIMM has a line of memory chips on a single printed circuit board (PCB) with a single edge connection. A DIMM, on the other hand, uses a very similar construction, but utilizes both sides of the printed circuit board to provide almost double the memory capacity in almost the same amount of physical space.
Memory accesses, such as from a bus, may be to a single byte of data, or digital information, stored at a single address space or to a large chunk of data stored in contiguous address spaces. Accesses to a large number of contiguous address spaces permits the memory subsystem to perform the data transfer in a direct memory access (DMA), whereby each byte, word, double-word, etc. of data in the contiguous address space is quickly read, written or otherwise accessed, without help from the processor.
Commonly, memory accesses even to a single address space will cause the memory controller to access a larger number of contiguous address spaces which includes the desired address. By doing so, the memory controller accommodates the cache functions of the computer system. A cache is a small, intermediate, fast memory subsystem between a fast processor and a slower memory subsystem. The purpose of a memory cache subsystem assumes that a memory access to a particular address space will usually be followed by a memory access to the next contiguous address space, and so on for several memory accesses. The cache subsystem quickly accesses a larger number of address spaces, referred to as a cache line, surrounding the requested memory address space. The cache line is stored in the cache memory, a memory device with a faster response time than the main memory. Subsequent memory accesses to addresses in the same cache line may be responded to by the cache subsystem much more quickly than by the main memory, so the processor, or other device requesting the memory access, does not have a long waiting period for the access to complete. To provide a cache line, the memory modules may be accessed in memory blocks containing about 16, 32, 64 or 128 bytes or other size depending on the type of processor in the computer system.
Due to various reasons, the data retrieved from a location in a memory module may contain an error. For example, one of the bits may have the opposite value when read than it had when the data was written to the address space. To permit the memory subsystem to check for errors, data may be written with additional bits which, along with the data bits, may be decoded to determine whether one or more of the bits is wrong. For example, 64 bits of data may be stored with 8 additional bits, for a total of 72 bits, so that error checking and correcting (ECC) logic in the memory subsystem can decode all 72 bits to determine the location of an erroneous bit and to correct it before returning the data in response to the memory read access.
An uncorrectable error is one for which the ECC logic cannot determine the location of the error (e.g. there may be too many erroneous bits) and can be fatal to the computer. Since the memory subsystem cannot determine what the information is supposed to be, the processor may interpret it as an invalid command, or a command that sends the processor to perform a completely incorrect function. Either way, the computer system may crash and have to be shut down and rebooted.
An uncorrectable error may be preceded by a number of correctable errors at the same location. Thus, if the memory subsystem or the system software can keep track of the correctable errors that occur in the entire memory array, then a potential risk of a fatal error may be detected before it occurs, and the memory module containing the failing location may be replaced before a catastrophic event occurs to cause a user or an enterprise to lose valuable data or time in performing work. It is, therefore, desirable to have a way to fail-over, or move to a different location, the data before the problem with the memory module causes an uncorrectable error, resulting in a system crash.
The most common problem when a memory module starts to develop errors is typically not due to the entire memory module. Rather, the initial problem is usually due to just one of the cells storing just one bit that has developed a soft, or correctable, error, while the remainder of the memory module, which may contain anywhere from kilobytes to gigabytes of memory, is still good and useable. Thus, failing-over an entire memory module due to an error in a single bit in one memory block is a bit of over-kill. It would be more desirable to fail-over a much smaller chunk of memory, so the standby memory module need not be as large as the largest primary memory module, thereby saving the cost of a large standby memory module. Another advantage in failing-over a smaller chunk of memory would be in the time saved to perform the transfer of information from the failing memory module to the standby memory module, so delays in arbitrating for the memory bus for other memory accesses will be minimized, and the overall performance of the computer system will not be affected.
Errors also tend to occur in a random fashion, wherein one memory block in one memory module may have one bad bit, while th e n next bad bit may be in another memory block in a different memory module. Thus, in the above example, unless the computer user can replace a failed -over memory module as soon as possible e after the fail-over is complete, there is a risk of catastrophic failure to the computer, since the memory system will not be able to fail-over another memory module after another memory error occurs.
In one embodiment, a computer system according to the present invention includes a bus subsystem, a processing unit, a mass storage device, a memory module array, a memory controller and a memory fail-over subsystem which cooperates with the memory modules of the memory module array to fail-over individual memory blocks in multiple memory modules. Another embodiment includes just the memory controller with the fail-over circuitry for failing-over individual memory blocks in one or more memory modules. Another embodiment is a method of controlling accesses to multiple memory modules, each h having multiple memory blocks, by monitoring errors in accesses to the memory blocks in relation to a permssible error threshold and failing-over only an individual memory block upon detection of memory errors exceeding the permissible error threshold for that memory block. In this manner, digital information intended to be stored in the individual memory blocks is actually stored in an auxiliary memory location. Thus, memory accesses by the processing unit, the mass storage device or other element in the computer system to the failed-over memory blocks of multiple memory modules are redirected to and satisfied by the auxiliary memory. Tags associated with and identifying the failed-over memory blocks are stored in a tag storage location. A tag look-up circuitry compares the upper bits of each memory address with the stored tags to determine whether the memory access is to a failed-over memory block and provides a hit signal in response thereto.
In another embodiment, the computer system includes an error monitoring circuitry which provides error data to the memory controller so that the fail-over subsystem will fail-over an individual memory block when the error data for the individual memory block exceeds a permissible threshold. Thereafter, memory accesses to non-failed-over memory blocks are still satisfied by the remaining memory blocks in the memory modules. An error log stores the transmitted error data to be processed to determine whether the error data exceeds the permissible threshold. Upon fail-over, a tag corresponding to the failed-over memory block is stored in a tag storage area, and the data, or digital information, that was stored in the failed-over memory block is transferred to a location in the auxiliary memory. The tag further corresponds to the auxiliary memory location for the transferred digital information. The auxiliary memory may be either embedded in the memory controller or external thereto. A tag look-up circuitry determines whether a memory access is to a failed-over memory block and provides a hit signal to the memory controller if it identifies a matching tag in the tag storage area, so that the memory access may be satisfied by the auxiliary memory location corresponding to the identified tag.
Therefore, it is desired to provide a memory fail-over system for a computer system that can fail-over individual memory blocks from multiple memory modules.