After building a computer system, a system designer or software programmer will usually test the system to determine whether it responds appropriately. Generally, a designer may generate errors within the system to test the software and verify that all of the fault cases are covered and that the system handles exceptions correctly. As software and electronic systems continue to become more complex, however, it becomes more difficult to test the system and software for errors and the system's responses to those errors.
As an example, FIG. 1 illustrates a representative system 100 of a conventional enclosure 104. A host bus adapter (HBA) or IOC 102 may be connected to a disk enclosure 104 by Fibre Channel (FC) link 106 or SAS or other communication protocol. A series of drives 108 may be attached to the enclosure 104. Within the enclosure 104, a microprocessor 110 generally controls a switch 112 to connect the various attached devices, other HBAs 102 and drives 108. The microprocessor 110 may include flash memory 114 and RAM 116 and a management interface 118. The management interface may be, for example, Ethernet, RS 232, SCSI Enclosure Services (SES) over FC, or SAS, or a virtual port.
In operation, the microprocessor 110 controls the switch 112 to make the correct connections. When an error occurs at the lowest level, a message may be sent up the chain of command from the switch 112, to the microprocessor 110, to the management interface 118, and to the controlling device. That controlling device then determines the next action and sends the appropriate signals back down the chain of command. For example, the microprocessor may report to a redundant array of independent disks (RAID) controller that a disk has failed. The RAID controller then sends the appropriate controls to rebuild the RAID set to recover from the failed disk. The higher controlling device may be an end user, such as the programmer, a RAID head, or a mainframe computer. A system designer attempts to simulate all possible errors the system may encounter during actual run-time in order to verify that the system responds correctly.
A large system may include dozens of enclosures, multiple RAID heads, and different computers all interacting together. These systems can become very complex. To validate the system, a system designer should ideally attempt to simulate any error that could possibly happen. If an error is not simulated, then that error may cause unanticipated or unexpected results when it occurs in actual performance.
Generally, a software programmer or system designer may wish to test the responses of the upper management system, or the entire system, to determine how the system responds to various errors. To physically cause the errors within a system would be costly, and therefore a programmer may wish to simulate the errors to determine if the system responds appropriately. There are currently two major mechanisms implemented to simulate errors in systems: (1) insertion of test circuitry into the hardware such as an application specific integrated circuit (ASIC); and (2) firmware interception and modification of hardware or ASIC responses.
One option available to a system designer is to connect test devices to the drive ports of the system enclosure. These test devices simulate typical traffic across the system and can also introduce certain errors into the traffic. However, this requires additional hardware. The test equipment would have to be able to trigger the desired errors and be fast enough to simulate the actual run-time errors encountered during the actual performance of the system. The test equipment is generally expensive and bulky and may not be available to some companies. The time to test each possible error for each possible device drive by physically attaching test equipment can be time and cost prohibitive. Therefore, this option is generally not available to most designers.
The system designer may alternatively design the ASIC to generate faults and corrupt data under test. The ASIC can be designed with additional hardware and software to generate test errors. However, given the constantly increasing complexity of ASICs, insertion of test circuitry to force a significant amount of error cases to be covered can become a major design effort, rivaling the complexity of the original design. Since the design is within the ASIC, the errors that can be generated are limited to those that were designed into the chip. Therefore, the test designer must be extremely knowledgeable about the details of the design and the impact that any low-level error has on the larger system to foresee and incorporate any desired test errors into the ASIC design. Using this methodology, a significant portion of the design implementation may be fault insertion circuitry. Not only does the additional circuitry increase die size and associated cost, but the additional logic increases the probability of soft errors and increase power consumption required of the device.
The final alternative is to provide software within the upper management system, such as the microprocessor, to simulate error reads. Software is used to modify the data read from the chip, and the microprocessor sends the modified data to the upper levels of management. The system is run in a “test mode,” alerting the microprocessor that the modified data should be used instead of the actual data read from the ASIC. However, the additional software requires processing time, which slows the performance of the entire system, even when it is not being tested. Modifying the status information read from the ASIC in firmware modifies the behavior of the code being tested and/or introduces overhead into all operations performed by the firmware. The firmware must read the status from the ASIC, determine if it is in test mode, and then branch, either fetching data to modify the original status or performing the normal operation on the data. In real-time systems, the overhead servicing test modes could result in significant differences from normal run-time operations, possibly causing abnormal behavior that does not exist in the real code or masking issues that do exist. Additionally, it is difficult to isolate the effects of the function initiating the test and the run-time firmware's response.