RAID (redundant array of independent disks) disk arrays are a type of fault tolerant storage technology developed in response to an ever-increasing need for uninterrupted availability of data. RAID disk arrays are designed to provide unperturbed access to large amounts of data stored on computer systems. Such disk arrays typically include redundant components such as controllers and power supplies, in addition to providing hot-swap capabilities for modules (i.e., an ability to change-out modules without powering down the system). Because the loss of access to data translates directly into lost productivity, such fault tolerant storage systems often play a critical role in the success of many business operations.
A significant challenge in the development of fault tolerant systems is that of validating fault tolerance mechanisms designed into the systems. Firmware written for embedded controllers in such systems has significant portions of code dedicated to handling various fault scenarios. However, the firmware code written to handle faults is typically much harder to evaluate than the firmware code written to support normal system operations. While normal system operations can be evaluated through the use of benchmark programs, a system's tolerance to faults cannot. In addition, designers do not have the luxury of allowing a system to run for many years in order to observe its behavior under various fault events that may or may not occur. The generally accepted solution to this problem, therefore, is to inject faults into the system and to observe the behavior of the system under the injected faults.
There are, however, numerous difficulties and disadvantages with conventional methods of implementing fault/error injection. For example, in a fault tolerant storage system such as a disk array, mimicking errors in system memories requires accessing the memory to corrupt the memory content. This is typically accomplished by embedding some temporary test code in the firmware to purposely corrupt the memory. However, the embedded test code modifies the firmware code paths and has a fixed timing relationship between the error occurrence and code execution. By contrast, real errors occur with a more random distribution. In addition, this method does not allow for a full test of the “final code”, because the test code must be removed before the system is shipped to a customer.
Another typical method of implementing fault/error injections in a fault tolerant system is to connect special fault generation circuitry to the system. For example, system controller boards can be modified into test boards that include circuitry that permits the grounding of particular inputs and outputs on the controller when a switch is thrown. However, these testing modifications are typically intrusive and often result in degraded performance of the controller boards. For example, point-to-point, high-speed data interfaces on controller boards can be disrupted by the addition of small test pads. In addition, because such modifications are extensive, there is typically a tradeoff between the number of test boards that can feasibly be produced and the number of error types that can be injected. Therefore, it is virtually impossible to exercise the hundreds or thousands of fault paths that exist within the firmware on most fault tolerant systems.
Modifying system controller boards into test boards for fault/error injection testing is also problematic from a project development perspective. For example, any given project will have some number of controllers at various levels of revision. Therefore, modifications have to be re-applied for each new version of controller hardware. However, because the modifications are unique, the significant effort required to automate testing for each modified revision is typically not undertaken, and the degree of error injection test coverage is significantly reduced.
Accordingly, the need exists for a way to verify the fault tolerance of fault tolerant systems that does not disturb the normal execution of firmware or otherwise degrade the performance of controllers within such systems.