Redundant array of independent disks (RAID) subsystems have been utilized for a number of years. In fault tolerant RAID subsystems, the primary objective for fault tolerance is not to prevent any type of fault from occurring but rather to continue to operate correctly during the presence of a component fault. There are many different methods for achieving the fault tolerant goals. However, even when these objectives are clearly in front of designers, it is often the case that this fault tolerance objective is not actually achieved.
For example, depending on the type of fault, some faults are so large that the system must be completely halted (e.g., a fire). Others will be fairly isolated and potentially corrupt the users data stored on the RAID subsystem. Once data is corrupted, it is generally less desirable to pass the corrupted data back to the host and advertise the data as being good. A system that is tolerant of all faults will not pass corrupted data back to the host.
In the past, fault tolerance was largely viewed as a vehicle to provide robustness and correctness of operation. Fault tolerance becomes very important when considering that the demand for complete data availability is increasing to extreme levels. For example, some systems provide a guaranteed down time of only 5 minutes per year.
The storage subsystem is just one component of many in some large systems. For example, a RAID subsystem may have an allocation of only 1 minute out of the total 5 minutes for yearly down time. Additionally, the subsystems of the RAID subsystems connected to this large system have to share this remaining 1 minute. It is typically unacceptable to ever allow data to become unavailable from the RAID storage subsystem. Further, the restrictions related to loss of data availability are increasing dramatically over time.
In conventional arrangements, one could provide fault tolerance and continued operation by halting all operations in the system, initiating a subsystem wide reset, reconfiguring the system to disable the failed component, and resuming operations after the “warm boot” operation. The time required to reboot the system is so long (on the order of a few seconds) that the data availability goals are significantly impacted by the reboot strategy. Such delays may approach unacceptable periods of time.
Accordingly, there exists a need to provide improved fault tolerant data storage systems and methods of operating fault tolerant data storage systems.