1. Field of the Invention
The present invention relates to computer systems, and more particularly, to high availability computer systems having error detecting and correcting random access memory, at least one cache memory and a fail-over replacement system for defective portions of the random access memory.
2. Description of the Related Technology
Use of computers, especially personal computers, is becoming more and more pervasive because the computer has become an integral tool of most information workers who work in the fields of accounting, law, engineering, insurance, services, sales and the like. Rapid technological improvements in the field of computers have opened up many new applications heretofore unavailable or too expensive for the use of older technology mainframe computers. These personal computers may be used as stand-alone workstations (high end individual personal computers) or linked together in a network by a xe2x80x9cnetwork serverxe2x80x9d which is also a personal computer which may have a few additional features specific to its purpose in the network.
The network server may be used to store massive amounts of data, and may facilitate interaction of the individual workstations connected to the network for electronic mail (xe2x80x9ce-mailxe2x80x9d), document databases, video teleconferencing, whiteboarding, integrated enterprise calendar, virtual engineering design and the like. Multiple network servers may also be interconnected by local area networks (xe2x80x9cLANxe2x80x9d) and wide area networks (xe2x80x9cWANxe2x80x9d).
As users become more and more dependent on computers, the requirements for the computer system remaining operational when most needed is of paramount importance. An unplanned service outage because of a computer server crash may leave customers waiting in line at checkout counters, doctors unable to obtain patient data, on:line users unable to log onto a network; an office slowdown or even shutdown because documents, e-mail, accounting information, Internet web page hosting becoming inaccessible, etc.
The network servers are being widely used in mission critical business, scientific and government applications by, for example, tying together the personal computer workstations into a network (LAN and WAN), and for storing and/or forwarding critical information. Software applications such as databases that run on these servers are becoming more memory intensive than ever before. The memory systems of these servers are continually becoming larger in order to handle the more demanding software application programs and files associated therewith. At the same time, rapidly advancing electronics technologies enable microprocessors and associated memory devices to run at ever faster clock speeds using lower voltages. The lower voltage creates a lower data signal noise margin, and the higher clock speeds exasperate noise conditions. As a direct result, the computer system environmental noise becomes a more significant factor and data is more vulnerable to errors cause by transient electrical and electromagnetic phenomena that can corrupt the data stored in the memory subsystem.
When a memory error does occur, a server should not lose critical data or crash. A server may employ an error checking and correcting (ECC) logic circuit to improve data integrality and thus data availability by detecting and correcting xe2x80x9csoftxe2x80x9d data errors within the memory subsystem. Error detection and correction allows the server memory subsystem to operate continuously, and to be available as long as the detected errors are correctable by the ECC logic circuit. However, a memory address of the memory subsystem that has experienced excessive ECC soft errors is more likely to continue generating errors and the severity of these errors may increase to the point where the ECC logic circuit can no longer correct all of the errors. At the point of being unable to correct all of the errors, the server may crash.
Defective portions of a memory module(s) (i.e., having an excessive amount of errors) have been replaced or bypassed in the memory subsystem by marking the section (for example 128 KB) of faulty memory (due to excessive or non-correctable errors). Then the server would need to be shutdown and then restarted, without the section of faulty memory mapped into the computer system address space, thus, a network outage is required and a subsequent reduction in system memory capacity results.
Fully redundant memory subsystems may be used and when excessive errors occur in one it is marked as defective. When the server is restarted the defective memory is not mapped as useable memory, however, half of the system memory is no longer functional and system performance suffers.
A standby hot fail-over memory system allows the memory controller to fail-over to a standby memory module the data stored in the memory module having errors before an uncorrectable error happens. This fail-over system, however, allows only one memory module to be replaced. It cannot solve the problem of errors coming from multiple memory modules. An additional memory module must be designated as the standby memory module. It takes a longer time for the fail-over process to complete since all of the data stored in the failing memory module must be transferred to the standby memory module. The fail-over time is dependant upon the memory size and the actual memory traffic that is generated during the fail-over process. The standby hot fail-over memory system is more fully described in commonly owned U.S. patent application Ser. No. 08/763,411, filed Dec. 11, 1996, entitled xe2x80x9cFailover Memory for a Computer Systemxe2x80x9d by Sompong P. Olarig, and is incorporated by reference herein.
A fast fail-over memory allows the memory controller to support multiple memory address failures while the computer system is running before an uncorrectable error occurs. The fast fail-over memory system requires a portion of additional standby memory space to function. If there are no memory errors, the fail-over standby memory is not being used. The fast fail-over memory system is more fully described in commonly owned U.S. patent application Ser. No. 09/116,714, filed Jul. 16, 1998, entitled xe2x80x9cFail-Over of Multiple Memory Blocks in Multiple Memory Modules in a Computer Systemxe2x80x9d by Sompong P. Olarig, and is incorporated by reference herein.
The processor or plurality of processors in a computer system run in conjunction with a high capacity, low-speed (relative to the processor speed) main memory, and a low capacity, high-speed (comparable to the main memory speed) cache memory or memories (one or more cache memories associated with each of the plurality of processors).
Cache memory is used to reduce memory access time in mainframe computers, minicomputers, and microprocessors. The cache memory provides a relatively high speed memory interposed between the slower main memory and the processor to improve effective memory access rates, thus improving the overall performance and processing speed of the computer system by decreasing the apparent amount of time required to fetch information from main memory
In today""s single and multi-processor computer systems, there is typically at least one level of cache memory for each of the processors. The latest microprocessor integrated circuits may have a first level cache memory located in the integrated circuit package and closely coupled with the central processing unit (xe2x80x9cCPUxe2x80x9d) of the microprocessor. Additional levels of cache may also be implemented by adding fast static random access memory (SRAM) integrated circuits and a cache controller. Typical secondary cache size may be any where from 64 kilobytes to 8 megabytes and the cache SRAM has an access time comparable with the processor clock speed.
In common usage, the term xe2x80x9ccachexe2x80x9d refers to a hiding place. The name xe2x80x9ccache memoryxe2x80x9d is an appropriate term for this high speed memory that is interposed between the processor and main memory because cache memory is hidden from the user or programmer, and thus appears to be transparent. Cache memory, serving as a fast storage buffer between the processor and main memory, is not user addressable. The user is only aware of the apparently higher-speed memory accesses because the cache memory is satisfying many of the requests instead of the slower main memory.
Cache memory is smaller than main memory because cache memory employs relatively expensive high speed memory devices, such as static random access memory (xe2x80x9cSRAMxe2x80x9d). Therefore, cache memory typically will not be large enough to hold all of the information needed during program execution. As a process executes, information in the cache memory must be replaced, or xe2x80x9coverwrittenxe2x80x9d with new information from main memory that is necessary for executing the process thread. The information in main memory is typically updated each time a xe2x80x9cdirtyxe2x80x9d cache line is evicted from the cache memory (a process called xe2x80x9cwrite backxe2x80x9d). As a result, changes made to information in cache memory will not be lost when new information enters cache memory and overwrites information which may have been changed by the processor.
Information is only temporarily stored in cache memory during execution of the process thread. When process thread data is referenced by a processor, the cache controller will determine if the required data is currently stored in the cache memory. If the required information is found in cache memory, this is referred to as a xe2x80x9ccache hit.xe2x80x9d A cache hit allows the required information to be quickly retrieved from or modified in the high speed cache memory without having to access the much slower main memory, thus resulting in a significant savings in program execution time. When the required information is not found in the cache memory, this is referred to as a xe2x80x9ccache miss.xe2x80x9d A cache miss indicates that the desired information must be retrieved from the relatively slow main memory and then placed into the cache memory. Cache memory updating and replacement schemes attempt to maximize the number of cache hits, and to minimize the number of cache misses.
Information from main memory is typically stored in xe2x80x9clinesxe2x80x9d of cache memory which contain a plurality of bytes or words from the main memory such as, for example, 16, 32 or 64 bytes of information. The plurality of bytes from main memory are stored sequentially in a line of cache memory. Each line of cache memory has an associated xe2x80x9ctagxe2x80x9d that stores the physical addresses of main memory containing the information in the cache line as well as other things such as xe2x80x9cMESIxe2x80x9d state information for the cache line. From the example above, if 16 bytes of information are stored in a cache line, the least significant 4 bits of the physical address of main memory are dropped from the main memory address stored in the tag register. In addition, the tag register may contain a cache consistency protocol such as xe2x80x9cMESIxe2x80x9d (Modified, Exclusive, Shared and Invalid) to ensure data consistency in a multi-processor or bus master environment.
What is needed is a system, method and apparatus for replacing failing (excessive error generating) memory locations (addresses) without requiring additional standby memory modules, or having to shutdown and restart the computer system.
The present invention overcomes the above-identified problems as well as other shortcomings and deficiencies of existing technologies by providing in a computer system an apparatus, method and system to automatically fail-over multiple blocks of failing (error prone) memory locations from one or more memory modules to cache memory while the computer system is running and without having to reboot (restart) the computer system. The memory modules may be for example but not limitation, single-in-line memory modules (SIMM), dual-in-line memory modules (DIMM), xe2x80x9cDIRECT RAMBUSxe2x80x9d in-line memory modules (RIMM), portable RAMBUS in-line memory modules (SO-RIMM), and the like of dynamic random access memory (DRAM). DIRECT RAMBUS, RIMM, SO-RIMM, and RAMBUS are registered trademarks of Rambus Inc., 2465 Latham Street, Mountain View, Calif. 94040, USA.
According to the embodiments of the present invention, a cache memory and a logic circuit, such as a core logic chipset, adapted for operation with the cache memory operates normally for caching data and instructions from system memory. However, when a hard memory failure or an excessive number of ECC errors has been detected, the failing memory location(s) or (address(es)) are remapped to the cache memory on a cache-line by cache-line basis. While the cache-line of memory locations or addresses having the failing memory address(es) is being remapped to the cache memory, any further requests to the failing memory (at least the cache line of such memory locations or addresses) will be stalled. Once the failing memory address(es) have been remapped to a cache-line of the cache memory, all reads from, or writes to those memory addresses remapped to the cache-line (a cache xe2x80x9chitxe2x80x9d) are handled from the cache memory without ever needing to access the failing memory location(s) again. The computer system continues to run normally without any significant performance loss, except that all requests to the failing memory location(s) are serviced from the cache memory. When a main memory error occurs, the computer system can notify a systems administrator of the failed memory via LEDs, LCD, GUI, a user message, and the like.
Each cache-line has status bits associated therewith that are used for cache coherency functions. The status bits follow a cache coherency protocol such as MESI. Cache memory is not large enough to hold all the data that the processor(s) need during operation of the computer system, therefore, cache-line entries are replaced when new data is brought into the cache. To ensure that a cache-line being used as fail-over memory is never replaced, an additional status bit for that cache-line is required. This status bit may be referred to hereinafter as the xe2x80x9cfail-over memory status bitxe2x80x9d and will be used to xe2x80x9clockxe2x80x9d this cache-line of cache memory and indicate that this cache-line is being used as fail-over memory and cannot be replaced with other instructions or data, i.e., be part of a pool of cache-lines available for replacement.
During normal cache memory operation, a replacement algorithm (e.g., least recently used) may designate any cache-line for replacement (updating of instructions or data used by the processor(s) of the computer system). However, when the fail-over memory status bit is set for a corresponding cache-line, that cache-line is being used as fail-over memory and it then becomes a part of the permanent computer system memory since the so marked cache-line has effectively replaced a cache-line of main memory having a failing memory location(s). The fail-over memory status bit is set whenever a memory error has been detected and the contents of that failing memory location(s) has been remapped to the cache memory.
The cache-line replacement algorithm will read the fail-over memory status bit, and if set, will not use the associated cache-line for a subsequent replacement operation. Since cache-lines that are identified as fail-over memory cannot be replaced, all reads to those cache-line addresses will result in hits in the cache memory. All writes to those so marked cache-line addresses will also be hits, however, the set fail-over memory status bit will serve as an indication that the modified contents of the cache-line need not be written to main memory (failed memory). Snarfing and snooping from other controllers of the computer system will still function by reading the modified contents of the cache-line but there will be no writeback or eviction of the cache-line so long as the fail-over memory status bit remains set.
Once the main memory with the hard failure or excessive ECC errors has been replaced (for example during a hot plug replacement), the remapped cache-line contents are written back to the new replacement memory. If the computer system is shutdown during a planned maintenance, then the contents of the memory locations (both good main and fail-over cache) that need to be saved are written to, for example, a hard disk storage system. Once the main memory having the hard failure or excessive ECC errors has been replaced, the appropriate fail-over memory status bit is reset and the associated cache-line becomes available to the cache controller for accepting new information (cache-line replacement).
An embodiment of the present invention utilizes a core logic chipset having an integral cache memory and memory controller. The core logic chipset has fail-over circuit logic that receives ECC error signals from the memory controller and that sets and clears a fail-over memory status bit associated with each cache-line of the cache memory. When an excessive ECC error signal is received by the fail-over circuit logic, a caching operation is performed on the affected main memory location(s) and the fail-over memory status bit is set. No further activity to the failing main memory occurs and any subsequent memory accesses, either read or write, are made directly and only to the fail-over memory cache-line in the cache memory of the core logic chipset. When the failed or ECC error prone memory is replaced, either through hot plug replacement or during a planned maintenance shutdown, the fail-over circuit logic will reset the fail-over status bit so that the associated cache-line may become available for accepting new information according to the replacement algorithm, and writeback to main memory is re-enabled. The fail-over circuit logic may be implemented in hardware, firmware, software, or any combination thereof.
Another embodiment of the invention utilizes a core logic chipset having an external cache memory, and internal cache and main memory controllers. The core logic chipset has a fail-over circuit as described above and performs similar functions as disclosed above. The external cache memory may be used in conjunction with an internal to the core logic chipset cache memory which may operate as strictly conventional cache memory or may serve the dual purpose of being a fail-over backup to the external cache memory which may be used in this embodiment as the fail-over memory. The internal and external cache memories may have error detection and correction circuits just like the main memory. Typically, cache memory comprises static random access memory (SRAM) which is usually more robust and reliable than main memory which typically comprises dynamic random access memory (DRAM).
Yet another embodiment of the invention utilizes a core logic chipset adapted for controlling a cache memory in a microprocessor. The core logic chipset has a fail-over circuit as described above and performs similar functions as disclosed above. The microprocessor cache memory, integrated with the microprocessor, may be used as the fail-over memory in computer systems not having or requiring cache memory in the core logic chipset or having cache memory external to the core logic chipset. The processor cache memory may be used in conjunction with a cache memory internal to the core logic chipset and/or a cache memory external to the core logic chipset. The processor cache memory may serve the dual purpose of being a fail-over backup to the internal and/or external cache memory which may also be used in this embodiment as fail-over memory. The processor cache memory may have error detection and correction circuits just like the main memory. It is contemplated and within the scope of the present invention that a plurality of processors may be utilized with the embodiments of the invention disclosed herein. The microprocessor cache memory may also have an memory controller integrated in the microprocessor package.
It is also contemplated and within the scope of the present invention that any or all of a plurality of cache memories located in the microprocessor(s), in the core logic chipset, and/or cache memory external to the core logic chipset may store failing memory locations and may migrate these cache-lines containing the failing memory locations between the various cache memories of the computer system. For example, failing memory locations may be initially stored in a plurality of cache-lines of microprocessor cache memory but it would be more advantageous to store these failing memory locations in cache memory external to core logic chipset because the external cache memory has a larger memory capacity than does the microprocessor cache memory. The embodiments of the present invention may write the contents of the cache-lines of a first cache memory being used as fail-over memory to corresponding cache-lines of a second cache memory and then set the corresponding fail-over memory status bit(s) of the second cache memory. Once the contents of the first cache memory have been successfully transferred to the second cache memory, the corresponding fail-over memory status bit(s) of the first cache memory may be reset. This operation is also possible between the second and third cache memories, etc.
When the processor and memory controller are independent, additional handshaking signals/messages may be needed for read and/or write operations. The processor would need to keep the cache-line to be written to memory in its cache until an acknowledgement has been received from the memory controller. In multi-processor systems where the processors and memory controller are independent, the aforementioned handshaking signals/messages are necessary as well as support for migrating cache-lines (described above) used for fail-over memory. If a processor has a cache-line that has been marked as fail-over memory (by setting the corresponding fail-over memory status bit), then when another processor requests ownership of the cache-line, ownership as well as the fail-over memory status must be passed. This is due to the MESI protocol where ownership of a cache-line is allowed to migrate. To ensure that the migrating fail-over memory is never lost, the fail-over memory status bit is also passed.
In conjunction with the embodiments disclosed above, the present invention may also comprise circuit logic and software for determining in a memory fail-over mode whether a memory location is failing by determining if a fault (error) is detected. Logging the address of the memory location having the error. Determining whether the error exceeds a predefined threshold, and if so, indicating that a memory fail-over has been initiated. Analyzing which memory module requires replacement. Mapping the memory module contents to be failed-over to a cache-line of the cache memory. Setting the cache-line fail-over memory status bit to prevent the cache-line from being written over. Replacing the failed memory module with a new memory module. Notifying the computer system that a new memory module has replaced the faulty memory module. Writing the fail-over cache-line back to the newly replaced memory module, and resetting the cache-line fail-over memory status bit so that the associated cache-line may be used again by the cache-line replacement algorithm.
An advantage of the present invention is that the contents of a faulty memory location(s) may fail-over to a cache-line of cache memory without disturbing the normal operation of the computer system, requiring specially designed memory modules, or modifying the operating system software or drivers.
Another advantage is that using cache memory allows the computer system to always have the same memory size to operate from without having to reboot the computer system and then having less main memory to work from.
Still another advantage is using SRAM of the cache memory as fail-over memory which is generally more robust and reliable than DRAM of the main memory.
Still another advantage is a performance boost from the cache memory during normal operation of the computer system.
Yet another advantage is that no portion of main memory need be reserved for standby memory.
A feature of the present invention is that standard memory modules may be utilized.
Another feature is that the fail-over cache memory may be located within the core logic chipset.
Another feature is that the fail-over cache memory may be located external to the core logic chipset.
Still another feature is that the fail-over cache memory may be located within a processor(s) of the computer system.
Still another feature is that LEDs may be used to indicate the location and status of faulty and/or new memory modules.
Other and further features and advantages will be apparent from the following description of presently preferred embodiments of the invention, given for the purpose of disclosure and taken in conjunction with the accompanying drawings.