The invention relates to fault tolerant data storage systems and methods of operating a fault tolerant data storage system.
Redundant array of independent disks (RAID) subsystems have been utilized for a number of years. In fault tolerant RAID subsystems, the primary objective for fault tolerance is not to prevent any type of fault from occurring but rather to continue to operate correctly during the presence of a component fault. There are many different methods for achieving the fault tolerant goals. However, even when these objectives are clearly in front of designers, it is often the case that this fault tolerance objective is not actually achieved.
For example, depending on the type of fault, some faults are so large that the system must be completely halted (e.g., a fire). Others will be fairly isolated and potentially corrupt the users data stored on the RAID subsystem. Once data is corrupted, it is generally less desirable to pass the corrupted data back to the host and advertise the data as being good. A system that is tolerant of all faults will not pass corrupted data back to the host.
In the past, fault tolerance was largely viewed as a vehicle to provide robustness and correctness of operation. Fault tolerance becomes very important when considering that the demand for complete data availability is increasing to extreme levels. For example, some systems provide a guaranteed down time of only 5 minutes per year.
The storage subsystem is just one component of many in some large systems. For example, a RAID subsystem may have an allocation of only 1 minute out of the total 5 minutes for yearly down time. Additionally, the subsystems of the RAID subsystems connected to this large system have to share this remaining 1 minute. It is typically unacceptable to ever allow data to become unavailable from the RAID storage subsystem. Further, the restrictions related to loss of data availability are increasing dramatically over time.
In conventional arrangements, one could provide fault tolerance and continued operation by halting all operations in the system, initiating a subsystem wide reset, reconfiguring the system to disable the failed component, and resuming operations after the xe2x80x9cwarm bootxe2x80x9d operation. The time required to reboot the system is so long (on the order of a few seconds) that the data availability goals are significantly impacted by the reboot strategy. Such delays may approach unacceptable periods of time.
Accordingly, there exists a need to provide improved fault tolerant data storage systems and methods of operating fault tolerant data storage systems.
The invention provides fault tolerant data storage systems and methods of operating a fault tolerant data storage system.
In one aspect of the invention, a fault tolerant data storage system comprises: a plurality of coupled components individually including: an interface adapted to couple with a data connection and to selectively receive a plurality of transactions from the data connection; transaction processing circuitry coupled with the interface and configured to process transactions received from the interface; and analysis circuitry configured to detect error conditions within the transactions and to prevent entry of transactions individually including an error condition into the respective component responsive to the detection.
In another aspect of the invention, a method of operating a fault tolerant data storage system comprises: providing a fault tolerant data storage system including a plurality of components configured to process transactions; providing the transactions for communication to respective components; detecting error conditions within the transactions; and preventing entry of transactions which individually include an error condition into respective components responsive to the detecting.
Another aspect of the invention provides a method of operating a fault tolerant data storage system comprising: providing a fault tolerant data storage system including a plurality of coupled components configured to process transactions; communicating transactions intermediate coupled components; detecting an error condition within one of the transactions; and isolating the component which outputted the transaction including the error condition responsive to the detecting.