This invention relates to computer systems in general, and more specifically to automatically replacing faulty memory devices while the system is operating.
Increasingly, computer systems are used in applications which require extremely high levels of availability. Well known examples include network servers, online trading systems, and air traffic control systems. Other applications which require a high degree of availability include embedded systems in industrial controls, medical systems and power distribution equipment.
In order to improve the reliability and availability and to decrease the downtime of such systems, improvements have been made to the I/O systems and secondary storage devices. Among these improvements are examples such as using disk arrays to improve the reliability of these devices and decrease downtime due to disk failures.
Improvements to primary storage devices have been limited. Previously, when a faulty memory device was discovered, access to that device could be avoided by the system until the system was turned off and the device manually replaced. This method reduced drastic system failures but degraded system performance. Therefore, a method of replacing faulty memory devices without degrading system performance would be useful.
According to one aspect of the invention, a method of automatically replacing a faulty semiconductor based memory device with a spare semiconductor based memory device is provided. This method consists of stalling access requests to both the faulty memory device and the spare memory device, copying the contents of the faulty memory device to the spare memory device, swapping the device ID of the spare memory device for the device ID of the faulty memory device, and re-enabling access to both memory devices.