This invention relates generally to the manufacture and testing of electromechanical or electronic storage systems and devices and more specifically to burn-in testing of such devices.
Manufacturers of storage systems such as Redundant Arrays of Inexpensive Disks (RAID) systems containing high-speed mass storage devices or manufacturers of other storage devices such as magnetic tape drive systems know that such devices are subject to infant mortality rates caused by latent defects in the devices. Devices that survive their early hours of operation are more likely to have a long and useful product life. Most of the failing devices can be detected by performing some method of burn-in testing to exercise the devices strenuously over a prolonged period of time in ways that stress the limitations of the systems. Ideally, these tests determine the validity of each of the physical drives in a system, and test the hardware, firmware, and software of the storage system's controller. Burn-in tests are usually more rigorous than ordinary diagnostic tests, and, if possible, attempt to create or simulate the conditions that are likely to induce errors. Defective devices detected in this manner can either be repaired, scrapped or have defective sectors or tracks marked accordingly.
In traditional manufacture of electromechanical magnetic media storage systems, one or more disk or tape drives is connected by an internal bus to a controller, which, in turn, is connected to a host computer by means of an input/output bus or channel. For disk or direct access storage device systems, commands are sent from the host computer to the controller, using an intelligent interface or protocol, perhaps one of the industry standards such as SCSI or IPI or another to send commands to the disk to effect a read or a write of data. A disk that is designed for SCSI protocol, will be connected to a controller capable of using SCSI protocol, which, in turn, will use that protocol to communicate over the bus with the host computer. Most disk and tape controllers contain microprocessors and memory which are used to store and execute these commands.
A dedicated host computer, such as a mini-computer, personal computer or microprocessor is used in burn-in testing to send a series of read and write and compare commands in a specific protocol, to controllers of the systems being tested. For random access devices, these are issued both as sequential reads and writes and random reads and writes. These tests are often coupled with tests that attempt to saturate the device with data or commands, to put stress on the mechanical parts of the system, as well as on the electronics and software.
As storage systems have become faster and more complex, burn-in testing becomes more difficult to do in a cost effective manner. A storage system that uses the SCSI protocol, needs to be tested with commands that conform to that protocol, while systems that use IPI, for example, or some other protocol, will need a different test setup.
The storage capacity of systems has increased greatly, so that storage systems capable of reading and writing billions of bytes of data in files are commonplace. With this increase in capacity, reliability becomes more important. RAID systems were developed to provide reliable storage affordably, by using ordinary inexpensive disks, combined with sophisticated hardware and software techniques to record or stripe data or parity or both over several different physical drives in a subsystem.
RAID systems are designed so that if a physical track or drive fails, the data can be reconstructed from the data and parity information on the other drives. This greatly increases the complexity of the hardware, firmware and software used in controllers of storage systems that incorporate RAID. Several different types of RAID systems exist, each having performance, reliability and cost characteristics suited to differing file sizes and applications.
A storage system using RAID 3, therefore, needs different burn-in test configurations from one using RAID 5. A RAID 3 system, for example, usually stripes data across a number of physical drives one bit at a time and writes parity data, used to construct lost data, to a single extra parity drive. A RAID 5 storage system, by contrast, usually stripes data and parity across a number of drives. Thus, the number of seeks or movements of the disk read/write heads may vary considerably between these two types of RAID for the same size file or total number of reads/writes. A test that might be a severe exercise for disk arms and heads in one kind of RAID system, may have much less impact on another.
In conventional burn-in testing, it has not been feasible to test a number of aspects of RAID systems, simply because the RAID hardware and software built into the storage system's controller may correct or prevent the occurrence of the situations one may want to test. For example, some RAID systems are designed to recover from certain types of file corruption such as data corruption or parity corruption. Using a conventional system, it is difficult to test these features in action on a specific drive. A host computer cannot really create a corrupted data block to send to a RAID storage system, since the host's own error correction features usually won't permit it. If the host could somehow create and send a corrupt data block, it would probably violate the protocol being used by the system, and would thus be rejected by the bus or the controller, before anything was written out to a disk. Similarly, it is difficult to gain access to the parity sectors of a RAID storage system through the host computer interface. Consequently, it is also difficult to intentionally corrupt parity on a disk during burn-in, to verify the controller's RAID recovery features. For storage systems that have a second interface, such as an engineering interface, which permits the user to bypass the normal control of the storage system, corrupt data or parity blocks might be entered manually, by an engineer or test technician, which would be inordinately expensive, or by a programmed computer. For a computer to replace the technician for this purpose, a dedicated computer for each system under test would be needed. Certain aspects of a RAID system may still be difficult or impossible to test, in this way, such as whole disk crashes or failures.
Different combinations of protocols and RAID or non-RAID types of systems also arise.
While one computer might be able to store and control and operate many different types of tests for different storage systems, the effectiveness of data and command saturation tests, and many other types of stress tests is directly proportional to how fast the system under test can be made to operate. If the host computer used for testing is handling too many storage systems, or too many different protocols, or even if the storage system being tested is a large one with several disk drives, speed suffers.
Most storage systems today include several types of internal memory within a controller, including fast cache memory to serve as a buffer for blocks of data or commands going to and from individual disks. These memories are selected to be fast enough to keep up with or stay ahead of the commands from the host. Thus, typical host computer interfaces cannot issue commands fast enough to a controller to cause such cache memories to fill. As a result, cache saturation conditions are difficult to instigate.
In disk manufacturing, it is known that some types of errors occur as a result of degradation over time. That is, intermittent errors may occur first, before a solid failure occurs. In burn-in testing, therefore, the more thoroughly a device can be exercised and stressed to its limits, the more likely it is problems can be found before a product is shipped to a customer. The more testing that can be accomplished in a given time period, the better the result.
For this reason, disk manufacturers often perform burn-in testing of storage systems by dedicating a host computer or host computer emulator to each system to improve the speed with which each system is tested. Many of the larger storage systems are designed to be used by more powerful and expensive mainframe or mid-range computers. Using such expensive systems in burn-in testing is too costly, so many manufacturers use inexpensive personal computers to emulate the large hosts.
Each host computer emulator must be properly configured for a given storage system protocol (SCSI, IPI, other) and type (RAID 5, RAID 3, non-RAID) for disks. If the host is emulating a particular mid-range or mainframe system, additional hardware and software upgrades may be needed for the host computer emulator.
Since burn-in tests are usually done over 3 to four days, this approach is still costly in its use of equipment and software and also creates a bottleneck in the manufacturing cycle. Each host computer emulator requires individual configuring and monitoring by someone, so labor costs tend to increase. Minicomputer or midrange systems are available that can test up to 8 storage systems, at once, but these are expensive computers. Using the less expensive personal computers, a PC has to be dedicated to each system under test, to get sufficient speed. So, to do burn-in testing of 50 new systems, from 7 to 50 host computer emulators are needed. Even with relatively inexpensive personal computers, this is costly both in equipment and space occupied by the equipment. Changing product lines or adding new protocols may require further upgrades to many or all of the host computer emulators.
In many cases, even the use of a dedicated midrange or mainframe computer may not sufficiently tax the storage system's components. While more tests can be run in a given time in this way, controller cache memories, for example, are still not likely to be saturated, simply because they are designed to keep up with the host.
It is an object of the present invention to provide a method and apparatus for burn-in testing of storage systems that eliminates the need for a dedicated host computer or host computer emulator by making the storage system self-testing.
It is another object of the present invention to provide a method and apparatus for self-testing storage systems that enables conventional tests to be run faster or more effectively or both.
It is an object of the present invention to simplify the testing of multiple protocols and types of systems, eliminating or greatly reducing additional hardware and software upgrades and costs.
It is another object of the present invention to permit testing of RAID disk storage systems, including testing of differing types of RAID recovery hardware, software, and firmware.
Yet another object of the invention is to be adaptable to new products and types of systems, as new tests for these are devised.