The invention relates generally to disk drive systems, and in particular, to the performance and maintenance of large scale disk drive systems.
Disk drive systems have grown enormously in both size and sophistication in recent years. These systems can typically include many large disk drive units controlled by a complex multi-tasking disk drive controller such as the EMC Symmetrix disk drive controller. A large scale disk drive system can typically receive commands from a number of host computers and can control a large number of disk drive mass storage units, each mass storage unit capable of storing in excess of several gigabytes of data. There is every reason to expect that both the sophistication and size of the disk drive systems will increase.
As the systems grow in complexity, so also does the user's reliance upon the system, for fast and reliable recovery and storage of data. Thus, it is more than a mere inconvenience to the user should the disk drive system go "down" or off-line; and even should only one disk drive go off-line, substantial interruption to the operation of the entire system can occur. For example, a disk drive storage unit may be part of RAID array or may be part of a mirrored system. The resulting lost time can adversely affect a system throughput performance and perceived reliability. This is true even for normally scheduled maintenance wherein, with advance warning to the user, one or more disk drives can be placed off-line for a period of time.
Many disk drive systems, such as the EMC Symmetrix disk drive system rely upon large standardized buses to connect the host computer and the controller, and to connect the controller and the disk drive elements. Periodically, however, the protocol of the system bus must be upgraded to implement performance improvements, to fix discovered programming errors, and for other normal maintenance reasons. The effect of reprogramming the disk drive communications, for example, using a SCSI bus, can be significant. Having to take the drive off-line, load into it the new SCSI code, and then bring the drive back on-line can take substantial time. During this time, the drive is effectively isolated and unavailable for any other purpose. The result can be a significant disruption to the normal operation and performance of the overall computer system.
In other instances, it is desirable to maintain a record of the operation of the disk drive by performing periodic maintenance of the drive. Again, such a function ordinarily requires the disk drive to be taken off-line, and can cause severe and undesirable interruptions to the operation and the performance of the disk drive system, and hence of the overall computer system.