Microcomputer systems are being utilized for increasingly complex tasks. Software applications such as databases that run on these systems are also becoming more memory intensive. Larger, denser, and wider memory devices are now being utilized. At the same time, rapidly advancing technology enables central processing units and memory devices to run at very high clock speeds at lower voltages. As a result, the system environment noise is growing and the data is becoming more vulnerable to errors caused by transient electrical and electromagnetic phenomenon.
Personal computer users are now utilizing their systems in critical applications. As a result, microcomputer systems, such as servers, may include fault tolerant features such as hot plugability and failover capability. These features in turn can reduce system downtime or improve availability. Because of the demands placed on these personal computer systems, users are continually in search of systems that provide maximum data availability and integrity at competitive cost and performance.
Memories in conventional personal computers are usually dynamic random access memories or DRAMs provided in single inline memory modules called SIMMs or dual inline memory modules called DIMMs. Commonly, a number of such modules are utilized in a memory subsystem under the control of a memory controller. If one of these modules performs poorly, it can adversely affect the operation of the entire system. In fact, a single uncorrectable fault in a single memory module can cause the system to crash.
Memory device failures fall generally into two categories. The first is a soft error, which refers to those errors where data stored at a given memory location change, but where subsequent accesses can store the correct data to the same location with no more likelihood of returning incorrect data than from any other location. The second type of error is a hard error, which refers to those errors in which data can no longer reliably be stored at a given memory location. Either of these types of errors can lead to catastrophic failure of the memory subsystem.
To address these problems, computer systems may include error correcting code or ECC to detect and correct, when possible, memory errors. In addition, a fault prediction scheme used with an ECC may enable a system operator to predict the onset of a memory problem and to replace the memory before the problem occurs. For example, a monitor may record the number of ECC errors that are detected and when some threshold level is reached, the user may be advised of a potential future problem. The user might elect to replace a module at that time.
To attempt to overcome these problems a variety of techniques have been proposed. One such proposal is to mark a section of the memory as being faulty when it experiences an excessive number of errors or when uncorrectable errors arise. In this way, the system can restart without the faulty memory being mapped into address space. The problem with this approach is that the system needs to reboot in order to implement it. Moreover, after the reboot, the system has less memory to work with. Each fault requires a segment of memory, for example 128K, to be mapped out, progressively decreasing the available memory.
Another approach involves downshifting the number of port configurations when there is a hard memory failure on one of two memory ports. However, losing one half of the total memory has a severe impact on performance. Moreover, the downshifting approach also requires system rebooting.
It would be highly desirable to provide a system for overcoming defects at particular memory locations without decreasing the overall available system memory. Likewise, it would be desirable to enable such corrective action to occur without requiring the system to reboot. Similarly, it would be preferable to enable corrective action prior to a catastrophic failure.