1. Field of the Invention
The present invention relates to a computer system having hot-pluggable components such as memory modules, and more particularly, to replacing, upgrading and/or adding the hot-pluggable memory modules without disturbing operation of the computer system.
2. Description of the Related Technology
Use of computers, especially personal computers, in business and at home is becoming more and more pervasive because the computer has become an integral tool of most information workers who work in the fields of accounting, law, engineering, insurance, services, sales and the like. Rapid technological improvements in the field of computers have opened up many new applications heretofore unavailable or too expensive for the use of older technology mainframe computers. These personal computers may be used as stand-alone workstations (high end individual personal computers) or linked together in a network by a xe2x80x9cnetwork serverxe2x80x9d which is also a personal computer which may have a few additional features specific to its purpose in the network. The network server may be used to store massive amounts of data, and may facilitate interaction of the individual workstations connected to the network for electronic mail (xe2x80x9ce-mailxe2x80x9d), document databases, video teleconferencing, whiteboarding, integrated enterprise calendar, virtual engineering design and the like. Multiple network servers may also be interconnected by local area networks (xe2x80x9cLANxe2x80x9d) and wide area networks (xe2x80x9cWANxe2x80x9d).
A significant part of the ever increasing popularity of the personal computer, besides its low cost relative to just a few years ago, is its ability to run sophisticated programs and perform many useful and new tasks. The personal computer thus has become an indispensable part of business and the economy. The network server plays an even more critical part in business computing by tying together the personal computer workstations into a network (LAN and WAN), and for storing and/or forwarding critical business information. This business information may comprise for example: databases, word processing, spreadsheets, drawings, graphics, e-mail, graphics, pictures, and the like. Therefore, it is imperative that there be a high availability of the computer system.
Replacement of defective components in a computer system or expanding the capabilities thereof, typically requires shutting down the computer needing service. If a defective component or upgrade is required in a network server, the potential for rendering inoperative all of the personal computers depending on that server is very high. The necessity for high availability of server operation has created hot-pluggable component designs as more fully described in commonly owned U.S. patent application Ser. No. 08/940,419, filed Sep. 30, 1997, entitled xe2x80x9cInstallation and Removal of Components of a Computerxe2x80x9d by Sompong P. Olarig, Kenneth A. Jansen and Paul A. Santeler, and is hereby incorporated by reference. Hot-pluggable component designs, however, have typically required specially designed hardware and software modifications to the operating system software for implementation thereof.
There are some components in a computer system that may become defective but still function for their intended purpose for a period of time. For example, system memory comprising a plurality of random access memory (RAM) devices may have added circuitry for error checking and correction (ECC) which allows memory storage bit errors to be corrected without the data stored therein being corrupted. ECC covers up memory storage bit errors, but it does not fix the RAM device(s) causing the problem. Ultimately, the defective RAM device(s) must be replaced before further degradation, which ECC cannot correct, causes the computer system to become inoperative. Upgrading and/or adding RAM devices without shutting down the computer system is also highly desirable.
What is needed is a system, method and apparatus for replacing failing but functional computer components, upgrading and/or adding components without disturbing the normal operation of the computer system, without requiring specially designed hot-pluggable components, and without installing special operating system software patches.
The present invention overcomes the above-identified problems as well as other shortcomings and deficiencies of existing technologies by providing in a computer system an apparatus, method and system for replacement of existing faulty components or expansion/addition of components without requiring the computer system to be brought off-line during operation thereof. Component replacement may be performed by logic and firmware within the computer system and requires no software operating system, application programming interface (API) or driver modifications. Industry standard components may be used with the present invention for the hot-pluggable capabilities.
An example of a replacement component is a random access memory (RAM) module that is plugged into a computer system memory connector. Other computer system components are contemplated herein and are within the scope of the invention. Some of these components may be, for example but not limitation: modem, network, video, SCSI, disk drive, and fibre channel interfaces; a processor in a multi-processor system, etc.
An embodiment of the present invention utilizes one or more RAM module connectors or memory connectors which may be electrically isolated from the computer system. Isolation, preferably, is obtained with field effect transistor (FET) signal isolation buffers interposed between the connector signal pins and a memory controller. Power FET switches are interposed between the connector power pins and the computer system power buses. Each isolatable memory connector may also contain one or more software and/or hardware controllable light emitting diodes (LEDs) of various colors (electronically controllable) to indicate the status of the corresponding memory connector. Control and sense logic is implemented to control the operation of the FET signal isolation buffers, power FET switches, LEDs, and memory bus operation between the system memory controller and the isolatable memory connectors. This control and sense logic may monitor (xe2x80x9csnoopxe2x80x9d) memory bus activity so that a memory connector may be brought on or off line only when the memory bus is inactive, i.e., no read or write activities.
When an undesirable number of correctable memory errors occur, the present invention enables on-line replacement of the faulty memory without disturbing the normal operation of the computer system. Correctable memory errors are the most prevalent errors encountered in a computer system having ECC memory. When a particular memory connector having faulty memory is detected, appropriate management traps and alerts are initiated through system software and/or hardware. An indicator LED associated with the faulty memory connector may be activated in such a way as to indicate the condition and location of that connector. In addition, another LED associated with a spare memory connector may indicate that the location of that spare connector has or is available for a replacement memory module. Due to the correctable nature of ECC memory, the faulty memory module will still continue to return correct data to the computer system. Upon inspection of the memory connectors and LEDs associated therewith, an operator or technician can see the status of each memory connector and may determine which connector has the faulty memory module and which memory connector has or can accept a new memory module to replace the faulty one.
The unused memory connector, as indicated by the associated LED, is completely isolated from the computer system memory by a set of FET isolation buffers and the computer system power buses by power FET switches. This unused memory connector either has or can accept insertion of a new memory module without affecting normal operation of the computer system. Upon need and detection of a new memory module in the spare memory connector, an interrupt may be generated to the computer system either by the user pressing an xe2x80x9cok to swap memoryxe2x80x9d switch, or by some automatic mechanism. This interrupt may be directed to a System Management Mode (SMM) interrupt handler. Upon receipt of this interrupt, SMM code, running above and protected from the operating system software, enables control and detection logic which causes all memory writes that are directed to the faulty memory module to also be directed to the new memory module recently detected and connected to the computer system. Memory system timing may be briefly relaxed to accommodate the increased bus loading of simultaneously addressing and accessing a second memory module.
In order to replace the faulty memory module, the contents of the faulty module must be copied to the replacement memory module. This may be accomplished by system management software (running in SMM) periodically interrupting the operating system software to copy a specified amount of data from the faulty memory module to the replacement memory module, and to enable and disable appropriate hardware to steer the data. Only a small portion of the faulty memory module is copied at a time so that critical operating system tasks are not suspended for too long. The SMM handler periodically invokes transfers of small portions of data from the faulty memory module to the new memory module until all of the data contained in the faulty memory module has been copied to the new memory module, i.e., the contents of the new memory module are fully synchronized with the contents of the old memory module. Then the faulty memory module is taken xe2x80x9coff-linexe2x80x9d and the new memory module is fully enabled to transact both read and write operations in the computer system. It is contemplated and within the scope of the embodiments of the invention that the contents of the faulty memory module may be transferred to the new memory module by the memory controller, independently of the normal operation of the computer system. The memory controller may synchronize the contents of the faulty and new memory modules when not being addressed by the processor(s) of the computer system (background mode of operation).
This xe2x80x9cbackground processxe2x80x9d insures that the entire contents of the faulty memory module are copied to the new xe2x80x9cgoodxe2x80x9d memory module. The control logic insures that all writes which may occur during the copying process are written to both the faulty and new memory modules. Consequently, the new replacement memory module will be completely data coherent (synchronized) once the SMM routine is finished copying all of the data from the faulty memory module.
As soon as all of the data in the faulty memory module is copied to the new memory module, the SMM process disables the memory connector containing the faulty memory module through the FET isolation buffers and FET power switch(es) and may use the associated LED to indicate that the copying and isolation processes have finished. The LED may also be used to indicate that the faulty memory module may be safely removed from the computer system without disturbing normal operation thereof. In addition, the control logic now steers all memory reads/writes to the new replacement memory connector and the memory bus timing may be restored to fall speed since only one memory module is now being addressed and accessed at a time. In this way, an entire memory module may be replaced without interruption of normal operation of the computer system.
The aforementioned embodiment of the invention also may be automatically performed if a xe2x80x9csparexe2x80x9d memory module has been previously installed into a xe2x80x9csparexe2x80x9d memory connector. In this case, the present invention may function as a transparent automatic memory fail-over. Upon detection of an undesirable amount of ECC errors from a particular memory module, the aforementioned data replication and isolation procedures may be invoked to copy the data from the failing or faulty memory module to the new memory module. This allows automatic unattended replacement of a faulty memory module(s) for remote and/or critical server applications.
Another embodiment of the invention allows on-line memory expansion. The operating system, however, needs to be notified of the added memory so that it can see and use it. The new memory module may be inserted into a spare memory connector, its presence is detected, tested and initialized then is made available to the operating system.
A combination of replacing a faulty memory module(s) and upgrading memory may also be performed by installing a new memory module having a larger capacity then the faulty module. The data from the faulty module is transferred to the new module, the remaining extra capacity of the new memory module is initialized, and the operating system is informed of the added memory capacity.
Upgrading the capacity of memory in a computer system may be easily performed, according to the present invention. Larger capacity memory module(s) may be installed into xe2x80x9csparexe2x80x9d memory connector(s) and the contents of the smaller capacity memory modules functioning in the computer system may be transferred to the new larger capacity memory modules as described above. When all data has been transferred, the lower capacity memory modules may be disabled and removed from the computer system memory connectors and the operating system informed of the increased memory capacity now available. The newly available memory connectors may then be subsequently used for adding or replacing memory modules as described above.
Upgrading or addition of a memory module(s) requires notification to the computer system operating software. This notification may preferably be done by system management software from a remote management console which may be, for example, windows based. It is also contemplated and within the scope of the embodiments of the invention to use hardware switches to designate the additional memory and size thereof.
An advantage of the present invention is that a faulty memory module may be replaced without disturbing the normal operation of the computer system, requiring specially designed memory modules, or modifying the operating system software or drivers.
Another advantage is that additional memory modules may be added to the computer system without having to shut down or disturb the normal operation thereof.
A feature of the present invention is that standard memory modules may be utilized.
Another feature is that standard memory module connectors may be utilized.
Still another feature is that LEDs may be used to indicate the location and status of faulty and/or new memory modules, and the location of an appropriate empty memory connector.
Other and further features and advantages will be apparent from the following description of presently preferred embodiments of the invention, given for the purpose of disclosure and taken in conjunction with the accompanying drawings.