The present invention relates to the field of fault tolerant computing systems, and more particularly to fault tolerant network servers having multiple processing elements that share common RAID systems for storing database data.
Like everything else made by Man, computer systems are known to cease functioning properly at times. Failed computing systems are known to cause serious problems for businesses that rely on them, including such transaction processing systems as bank database servers and airline reservation systems. There is therefore a strong market for failure tolerant computing systems and other devices, like UPS (Uninterruptable Power System) devices and backup generators, intended to minimize downtime for these businesses.
RAID (Redundant Array of Independent Disks) systems are known in the art of failure tolerant computing. In applications requiring fault tolerant computing, these systems frequently operate with several disk drives in RAID-1 (data mirroring), or RAID-5 (parity disk) mode. In either of these modes, it is possible for a database to be reconstructed after, or even to continue servicing transactions when, any one of the several disk drives in a RAID set has ceased to operate correctly.
It is known that, through use of hot-plugable disk drives in a shelf configured to receive them, it is possible to replace a failed drive of a SCSI-based RAID system with a spare drive without need to shut down the system. Hot-pluggable drives are usually contained in drive cases having a connector configured such that the power and ground connections to a drive being inserted are made before the drive is connected to the data lines of a SCSI bus. Once the failed drive has been replaced, reconstruction of data on the failed drive can also proceed while the RAID system continues at least some level of data transfers to processor units. Once data reconstruction is complete, the RAID system becomes once again fault tolerant.
A shelf of disk drives, or a RAID controller, of a RAID system may be powered by multiple power supplies receiving power from multiple sources. This is known to allow continued operation of the shelf of drives or RAID controller when any one power supply or power source fails or suffers a transient. Such systems are available from many sources.
RAID controllers are special-purpose computing hardware that map disk-access requests into operations on the array of disks. RAID controllers typically also generate the redundant data for RAID-1 and RAID-5 disks, and regenerate disk data as necessary when a drive is replaced. While these functions can be performed in software on a host computer, offloading these functions into a RAID controller is often advantageous for system performance because of the resultant parallelism. COMPAQ Storageworks(trademark) (a trademark or registered trademark of COMPAQ in the United States and other countries) sells RAID controller systems wherein one or two RAID controllers receive power from a communal DC power bus, the power bus being driven from multiple power supplies receiving power from multiple sources. These RAID controllers are available with SCSI interfaces to the disk drive shelves and host computer system.
RAID controllers, as with the COMPAQ Storageworks(trademark) systems, contain memory for caching disk operations. This memory may be configured in either a write-through or a write-back configuration.
The SCSI bus has several three-state data lines and several open-collector (or open-drain) control and data lines. The SCSI specification calls for the open-collector control lines to be terminated with pullups at each end of the bus. It is known that presence on a SCSI bus of typical, but unpowered, interfaces often draws at least some of these lines out of specification, especially if the unpowered interface is located at the end of the bus. Presence on a SCSI bus of such unpowered interfaces can therefore corrupt communications between operating interfaces.
It is known that system reliability may be enhanced by operating multiple processors in lockstep, with error detection circuitry used to detect any failed processor such that one or more remaining processors of the multiple processors continue execution. Multiple processors executing in lockstep are utilized in COMPAQ TANDEM fault-tolerant machines.
A Hot-Spare with failover technique may also provide a degree of fault tolerance. In this method, two or more processors are provided. Upon detection of an error or failure of one processor, a second processor, a hot or running spare, takes over the functions of the failed processor. The processor that serves as a hot-spare may also execute additional tasks, in which case a performance degradation may be observed when a processor fails.
Hot-spare with failover may also occur with processors sharing a database, as with processors operated in a xe2x80x9cclusterxe2x80x9d configuration. Clustered machines may have operating system software that redistributes tasks among remaining machines when a machine fails.
Most currently available RAID systems are sold separately from the processors they are used with. They therefore must be connected together in the field, where mistakes of installation can be made. Mistakes can include connection of both power connections of computing units to a first power source, with connection of both power connections of a RAID system to a second power source, such that if either power source fails, the system ceases operation. Further, field installation is often conducted by better educated, and thus more expensive, employees than is factory assembly. Field labor also has much higher travel and hotel expenses than do factory hands. Installation accuracy can be improved and expense reduced by reducing the number of connections that must be made during field installation.
A pair of computing elements are factory assembled into a network server, being slideably mounted on ball-bearing rails in a rack-mountable server cabinet. Also in the network server cabinet is a RAID disk-array subsystem containing a pair of RAID controllers, a pair of redundant power supplies, and a shelf holding six drives normally configured in RAID-5 mode. These drives may also be configured as a combination of a bunch of disks, or RAID-0, RAID-1, RAID-4 and RAID-5 sets. The computing elements each contain most of the constituent components of a dual-processor computer, and are electrically connected to the RAID controllers through SCSI isolators, whereby a failed computing element may be disconnected from the RAID controllers while the computing element is repaired or replaced.
The computing elements communicate with the RAID controllers through SCSI isolators. These isolators prevent a failed computing elementxe2x80x94especially a computing element with a failed power supplyxe2x80x94from corrupting communications between an operating computing element and the RAID controllers.
The computing elements also communicate with each other over a cluster interconnect and with various other servers and workstations of a network via a network interface.