The present invention generally relates to redundant controller systems and data storage systems employing redundant controllers, and more particularly to a redundant controller data storage system having an on-line controller removal system and method.
Multiple controller systems are used for providing highly reliable, redundant data storage systems. For example, in the hard disk drive industry multiple controller systems are used as part of a RAID (short for redundant array of independent disks) system which employs two or more disk drives in combination for improved disk drive fault tolerance and disk drive performance. In operation, RAID systems employ multiple controllers for redundancy. The multiple controllers stripe a user""s data across multiple hard disks. The array can operate from any one controller. When multiple controllers are present, the controllers are used for improved performance and/or increasing the number of host computer system connection ports. When accessing data, the multiple controller RAID system allows all of the hard disks to work at the same time, providing a large increase in speed and reliability.
A RAID system configuration is defined by different RAID levels. The different RAID levels range from LEVEL 0 which provides data striping (spreading out of data blocks of each file across multiple hard disks) resulting in improved disk drive speed and performance but no redundancy. RAID LEVEL 1 provides disk mirroring, resulting in 100 percent redundancy of data through mirrored pairs of hard disks (i.e., identical blocks of data written to two hard disks). Other disk drive RAID LEVELS provide variation of data striping and disk mirroring, and also provide improved error correction for increased performance, fault tolerance, efficiency, and/or cost.
A RAID 5 LEVEL breaks the data into blocks and stripes these across disk drives. A parity block is calculated from the data blocks and also stored to disk. All data and parity blocks are stored on different disks (striped). A failure of any one disk drive results in the loss of only one data block or the parity block. The array can then mathematically recreate the lost block. RAID 5 also rotates the disks where the data and parity blocks are stored, i.e., all disks will have some parity blocks stored on them. A RAID 6 LEVEL takes this one step further and calculates two xe2x80x9cparityxe2x80x9d blocks using different mathematical formulas. This allows the array to have two failed disk drives and still be able to recreate all data.
Known multiple controller systems include a mirrored dual controller data storage system. Each controller includes its own memory most of which is the xe2x80x9cmirror imagexe2x80x9d or the same xe2x80x9cmemory imagexe2x80x9d as the other. The use of mirrored memory in dual controllers allows for fast recovery and prevents data loss in case of failure or loss of one controller or its memory. Without the mirror copy of memory important data on one controller would be lost if that controller suddenly failed. For example, in a mirrored memory dual controller system having Controller A and Controller B, mirrored reads and writes result in the Controller A memory being the xe2x80x9cmirror imagexe2x80x9d of Controller B memory. Upon the loss or failure of Controller B, all system operations are automatically switched over to Controller A, such that Controller A runs or operates the entire system.
An increasing number of computer system applications require very high degrees of reliability including very limited processor downtime. For example, one known system requires the aggregate controller downtime to be less than five minutes per year. Loss or failure of one controller typically requires immediate replacement to maintain redundancy and reliability for the associated data storage system. Due to the above requirements, systems requiring a high degree of reliability and xe2x80x9cuptimexe2x80x9d typically require on-line or xe2x80x9chotxe2x80x9d insertion of a replacement controller during which the other controller (e.g., Controller A) remains operational. The operating system automatically recognizes the insertion of the replacement controller.
Typically the multiple controller system is connected to a host. As such, the host systems often require that the replacement of a controller board does not bring down the data storage system for a significant amount of time, resulting in a host system timeout. Insertion of the replacement controller into an operational system often causes system availability loss while the replacement controller is tested and added to the operational system. When the replacement controller is being added as part of a mirrored memory system, problems associated with adding the replacement controller into an operational system are increased.
In one known mirrored memory dual controller system, with Controller A operating in a system and replacement Controller B being hot inserted, includes both Controller A and replacement Controller B being reset and each controller performing a processor""s subsystem self-test. Each controller tests its own shared memory system to verify the hardware is functioning correctly. Each controller checks its shared memory contents to see if the memory image is xe2x80x9cvalidxe2x80x9d for its system. In this example, only Controller A will have a valid memory image of the system.
Next, each controller exchanges information about their revision, last view of the system and the system status the last time the system was active. After sharing this information, the firmware determines which controller has the valid memory image. In this example, Controller A has the valid memory image. Controller A""s shared memory image is copied to Controller B and verified. This requires the processor on Controller A to read all shared memory on Controller A and writing to all shared memory locations on Controller B. The memories on both controllers then read and compare to verify the copy operation was successful. For large memory systems, this process takes several minutes. Final configuration steps are performed, and the controllers are brought on-line and are fully operational. Many steps within the above process can take tens of seconds to perform. The process of copying Controller A shared memory image to Controller B and verifying can take several minutes. During this extended period of time required for hot insertion, most host computer operating systems will time-out.
It is desirable to have a hot insertion and/or system and method for use in a redundant, mirrored memory multiple controller system which reduces system downtime and does not result in a time-out of the host computer operating system. Further, it is desirable to have an efficient method of handling controller resets which minimizes system down time or host time-outs.
The present invention relates to multiple controller systems and data storage systems employing redundant controllers, and more particularly to a redundant controller data storage system having an on-line controller removal system and method.
In one embodiment the present invention provides a method of on-line removal of a controller from a redundant controller system. The redundant controller system includes a first controller and a second controller. The method includes detecting partial removal of the first controller from the redundant controller system. A shut-down sequence is performed on the first controller and the second controller, including completing outstanding memory accesses. The first controller is defined to have first memory, and the first memory is placed into a self-refresh mode. Removal of the first controller from the redundant controller system is finished after completion of the self-refresh moded by the first memory.
In another embodiment, the present invention provides a method of on-line removal of a controller from a redundant controller system. The redundant controller system includes a first controller and a second controller. The first controller is defined to include a first processor and the second controller is defined to include a second processor. Partial removal of the first controller from the redundant controller system is detected. A shut-down sequence is performed on the first controller and the second controller, including interrupting the first processor and interrupting the first processor and interrupting the second processor, in completing outstanding memory accesses for the first controller and the second controller. The first controller is defined to have a first memory, and placing the first memory into a self-refresh mode. Removal of the first controller from the redundant controller system is finished after completion of the self-refresh mode by the first memory.
In another embodiment, the present invention provides a redundant controller system configured for on-line removal of a redundant controller. The system includes a first controller including a first memory, a first processor, and a system for early detection of partial removal of the first controller. Wherein upon detection of partial removal of the first controller the first controller performs a shut-down sequence, including completing memory accesses outstanding on the first memory. A second controller includes a second memory and a second processor. Wherein upon detection of partial removal of the first controller the second controller performs a shut-down mode, including completing memory accesses outstanding on the first memory. Wherein after completion of the shut-down sequence by the first controller and the second controller the second memory is placed into a self-refresh mode, and removal of the first controller from the redundant controller system is completed.