Basic RAID storage systems include an array of redundant disks and a controller that enables a server transparently to perform I/O to the array. A RAID system greatly improves storage system reliability as data can be stored on multiple disks within the array. A RAID system also reduces the cost of storage as the small disks composing the array are relatively inexpensive.
An improved RAID storage system includes dual controllers, each configured to access the same array of disks. The dual controllers improve access to the array because the controllers can simultaneously serve I/O requests from two servers. Moreover, a dual controller (duplex) system can offer even greater reliability than a single controller (simplex) system if each dual controller is configured to handle all I/O requests in the case the other fails. This capability is called "transparent failover." Operation of one dual controller system is now described in reference to FIG. 1.
FIG. 1 shows a block diagram of a dual controller system 100 that includes two controllers 104-1, 104-2 and a disk array 106. The two controllers 104-1, 104-2 are coupled via a host bus 103 to one or more servers 102A, 102B. This configuration is common to prior art dual controller systems (e.g., the Mylex DAC960SX) and dual controller systems in which the present invention is implemented (e.g., the Mylex DAC960SF). The two controllers 104-1, 104-2 are coupled to the disk array 106 by a high speed bus 105. In the case of the DAC960SX both busses 103, 105 are SCSI busses and each controller 104 has its own SCSI ID. The controllers 104-1, 104-2 can operate in duplex mode (as a redundant pair of controllers) or in simplex mode (as independent controllers). When operating in duplex mode the controllers 104 communicate with each other using a communication signal 110 and a common reset signal (RSTCOM*) 112.
When configured as a redundant pair, both controllers 104 have access to the same disk drives 108 and both process host I/O. The communication signal 110 between the controllers keep each informed that the other controller is operating normally. If the communication signal 110 is interrupted, the controller 104 that detects the interruption asserts the reset signal 112 to the other controller 104 (holding the failed controller in a hard reset) and starts processing I/O for both controllers 104. This "Fail Over" is transparent to the host computers 102 because the surviving controller 104 can respond to multiple target IDs on the host SCSI bus 103. Interruption of the communication signal 110 can result from a controller 104 being removed from the system or the controller 104 experiencing a fault that causes it to lock-up when some abnormal operation occurs.
When the failed controller 104 is replaced, the surviving controller 104 releases the reset signal 112 and allows the new controller to start. Once running, the new controller 104 establishes the communication signal 110 and determines the system (i.e., array 106) configuration using COD (Configuration On Disk) stored on the array 106. The new controller permanently stores the system configuration in on-board, non-volatile, random access memory (NAVAM. The surviving controller then initiates a "Fail Back" sequence to hand over the I/O processing to the new controller. Following the Fail Back sequence the system is back to Active/Active operation in which both controllers 104 actively handle I/O requests.
At power up, each of the two controllers 104 verifies their own NVRAM-stored configuration versus the disk configuration stored on the disk array 106. If a controller 104 detects a discrepancy, that controller saves the disk configuation onto its NVRAM, hard resets both itself and its partner (via assertion of the common reset signal 112) and then comes back up with the correct NVRAM configuration.
A hard reset operation places a controller in a clean state by resetting the controller's main CPU and its I/O processors (the I/O processors implement the various communication protocols used by the controller to communicate with hosts 102 and the disk array 106.) In the DCA960SX, the hard reset operation is implemented by reset circuitry within the controller 104 that activates a reset pulse coupled to the controller's main CPU and I/O processors. (The reset pulse is not shown in FIG. 1 as it is an internal signal). It is important that the reset pulse remain active long enough to allow the CPU and the I/O processors to be completely reset.
The DCA960SX reset circuitry accomplishes this goal by delaying the active to inactive transition of the reset pulse using a fixed number of PLA (programmable logic array) gate delays. However, the resulting pulse width is likely to be highly variable depending on the PLA design rules. For example, smaller PLA gate geometries will reduce the gate delays. This variation could result in reset pulse widths that are too short to reset the processors. The DCA960SX's reset circuitry is also inflexible, being designed for a particular main CPU and set of I/O processors. As a result, a completely new PLA design would be required if a new I/O processor requiring a longer reset pulse were added to the controller.
Therefore, it would be desirable to provide reset circuitry for use in a dual active RAID controller to generate an internal reset pulse that reliably triggers a hard reset of the controller regardless of implementation details, such as different design rules used to implement the circuitry.
It would also be desirable to provide reset circuitry for use in a dual active RAID controller that can be easily modified to accommodate different required delay pulse widths.
It would also be desirable to provide reset circuitry for use in a dual active RAID controller that is compatible with other required reset operations, such as the generation of the common reset signal 112 and the handling of power status indicators requiring resetting of the controller.