1. Field of Invention
This invention relates to modular enclosures for components of redundant array of inexpensive disk (RAID) electronic data storage systems.
2. Prior Art
The acronym RAID refers to systems which combine disk drives for the storage of large amounts of data. In RAID systems the data is recorded by dividing each disk into stripes, while the data are interleaved so the combined storage space consists of stripes from each disk. RAID systems fall under 5 different architectures RAID 1-5, plus one addition type, RAID-0, which is simply an array of disks with data striping and does not offer any fault tolerance. RAID 1-5 systems use various combinations of redundancy, spare disks, and parity analysis to achieve conservation in reading and writing of data in the face of one, and, in some cases, multiple intermittent or permanent disk failures. Ridge, P. M. The book of SCCSI: A guide for Adventurers. Daly City Calif., No Starch Press. 1995. P. 323-329.
In order to increase reliability of RAID systems, conventional systems often have two or more controllers which control two or more arrays of direct access storage devices (DASD), each array often containing 6 or more DASDs, generally hard disks. Such RAID systems are arranged so that if one controller fails, another controller will take control of the other's DASD. In particular, in typical conventional RAID systems two controllers are arranged in a single chassis with a common backplane or cables and a common cooling system and a common power supply. The DASD are arranged in a multiple of chassis, each of which contains several individual DASD units (termed a “rack” of DASD). In conventional systems the controllers may share a common backplane or cables. Problems arise when there is a failure affecting the backplane or cables. When that occurs, both of the controllers may become inactivated or DASD may not be accessible, causing failure of the RAID system.
A backplane (termed a midplane if located near the middle of the chassis containing the controller or channel of DASD) is a circuit board with electronic components such as capacitors, resistors, chips, and connectors. A controller backplane serves to connect the two controllers, so that if one controller fails the other controller can detect the failure and communicate with the failed controller's DASD. A DASD backplane provides connectors into which several DASD can be inserted. The DASD may be connected to each other through one or more busses on the backplane.
Failure of a backplane or cable may be due to physical displacement of connectors, to physical failure of chips, to physical failure of traces on the boards, or to faults in cables or on computer boards. Failure of a common backplane which serves two controllers disrupts communications between the controllers and the DASD. Such an occurrence, while unexpected, has a catastrophic effect on the function of the RAID system, especially when two controllers share a single backplane or midplane, as in conventional RAID systems. In that case the entire RAID system becomes inactive. If data are striped within a single channel of direct access storage devices, the failure of a backplane serving the channel results in loss of data.
An active-active RAID system uses two RAID controllers that simultaneously process input and output (I/O) requests from host computers. The two RAID controllers communicate with one another, so that when one RAID controller fails, the surviving RAID controller takes over the identity of the failed RAID controller, takes over communication to the disks to which the failed RAID controller communicated, and takes over processing all the I/O operations for the RAID system.
After this automatic failover process, the failed RAID controller can be hot swapped, i.e., replaced with a functional RAID controller. The RAID controllers then perform a failback operation and restore the system to its original configuration. Thus, just as redundant disks enable a RAID system to continue operation after a disk fails, redundant RAID controllers in an active-active RAID system enable the system to continue operation after a RAID controller fails.
While an active-active RAID system can survive the failure of a disk or failure of a RAID controller, there are several other system components whose failure causes loss of data. This is a fundamental problem with prior art active-active systems.
For example, when a disk channel fails, the disks attached to that channel become unavailable. For RAID systems that have two disk channels and use parity RAID (such as RAID 5), the loss of the disks on a channel means the loss of data. This is a catastrophic failure of a RAID system to protect the integrity of data. There are a variety of problems that cause disk channel failure. The disk channel controller chip in a disk can fail and lock the disk channel. The disk channel controller chip in a RAID controller can fail and lock the disk channel. The physical disk channel itself can fail, e.g. as a result of the failure of a cable, a trace, a connector, or a terminator. In addition to these hardware failures, firmware in the disk channel controller chips in the disks or in the RAID controllers can lock a disk channel and cause catastrophic system failure.
In addition to disk channel failures, there are other single points of failure in RAID systems. A common example is the blackplane into which the RAID controllers are inserted. In the design of most active-active systems each RAID controller plugs into a common backplane. There are many ways in which a backplane can fail that cause the system to fail. Although some backplanes have only passive components to reduce the probability of failure, it is still the case that in most designs an active-active RAID system that uses a single backplane has multiple single points of failure that cause catastrophic data loss.
The communication link between controllers is another site for problems in an active-active RAID system. A link between controllers, sometimes called a heartbeat connect, is used to inform each controller of the status of the other controller. Should one RAID controller fail to send or respond to a signal, the other controller initiates failover activities. If the heartbeat connection fails while both controllers are operating properly, the system can become dysfunctional as both controllers attempt to take over the identification of the other controller and its disks.
The RAID system of the present invention avoids the failure of the RAID system or the loss of the data when there is a failure of any board, cable, power supply or cooling system in the controller chassis. In this invention, the two or more controllers which control the RAID system each have independent boards, cables, cooling, and power supply. Loss of one board, cable, cooling or power supply to one controller does not inactivate the entire RAID system or cause data loss. Similarly, loss of a board, cable, cooling or power supply to one direct access storage device DASD chassis results in inactivation of the affected DASD, but, since there is adequate redundancy in the racks of DASD units, the RAID system continues to function. In addition, this invention allows hot replacement of a failed controller along with associated backplanes, cables, power supply, or cooling system without interrupting the function of the RAID system.
The present invention insures the function of a RAID 1-5 system despite any single point of failure.
U.S. Pat. No. 5,761,032 discloses a removable media library unit with a frame structure with modular housing. A robot inserts media into the library and removes media no longer needed. There is continual access to one or more good storage devices while one or more failed drives of the library are repaired.
U.S. Pat. No. 5,871,264 discloses a drawer type computer housing with two sliding rails attached to the housing.
U.S. Pat. No. 6,018,456 discloses an enclosure system having a front and rear cages separated by a backplane. Connectors on either side of the backplane are used to connect trays containing drives in the front cage and sub-modules in the rear cage.
U.S. Pat. No. 6,025,989 discloses a modular node assembly for a rack mounted microprocessor computer. The assembly contains a power supply, fans, and removable chassis.
U.S. Pat. No. 6,061,250 discloses a full enclosure chassis system containing hot-pluggable circuit boards. A double height unit, such as a RAID controller, is combined with single height devices such as hard disk drives. The system allows the replacement of a controller circuit board without shutting down the system.
U.S. Pat. No. 6,097,604 discloses a carrier for installing electronic devices into an enclosure. An electronic device is attached to the carrier. Pushing the carrier into an enclosure causes metal surfaces on the carrier to be pushed outward contacting the enclosure side walls for electrical grounding.
U.S. Pat. No. 6,148,352 discloses a RAID system with provisions for adding a module or replacing a module without affecting host system access to existing online storage. Each storage module contains two sets of disk drives along with electronics for operating the disk drives. FIG. 10 shows storage systems with a power supply and a controller, in addition to the disks. In this system, one power supply serves one controller and 8 storage hard disk drives.
None of the prior art references provide the advantages of the present invention, that of the reliability of operation associated with an independent backplane board, cables, power supply and cooling system for each controller and for each rack of DASD. Conventional methods insure function of RAID systems despite failure of a single controller or DASD. With the innovations of the present invention, RAID systems are disclosed which function despite any failure of controller, DASD, backplane, or cable. The RAID systems of this invention eliminate the sharing of backplanes by more than one controller or more than one channel of DASD.