1. Field of the Invention
The invention relates to maintenance and storage of data within a storage system. Specifically, the invention relates to apparatus, systems, and methods for developing failure prediction software for a storage system.
2. Description of the Related Art
High density, removable media storage libraries are used to provide large quantities of storage in a computer system. Typically, such data storage systems are employed for backup or other secondary storage purposes, but may be used as primary storage in circumstances that are conducive to sequential data access and the like.
The data is stored on media cartridges, such as magnetic or optical disks, that are arranged in storage bins and accessed when data on a cartridge is requested. Generally, the data on a media cartridge is referred to as a volume. The data on a cartridge is accessed using a drive configured to read and write to the media of the cartridge. A data storage system may have many drives. Unfortunately, a drive or a  media cartridge may fail, such that data is permanently lost. Such failure is typically caused by regular repeated use of the drive and the volume. For example, a tape library may include three drives and ten times that number of media cartridges. The media cartridges are repeatedly mounted and unmounted in the drives in response to various data storage transactions.
Failures of a drive or volume to properly perform are generally categorized as one of three types of errors. A soft error is one in which data is not properly read from or written to a storage media such as tape, but the error is correctable without affecting the data throughput in completing a data storage transaction. One example of a soft error is a write skip in which the writes data, reads the data back to verify accuracy, identifies a discrepancy, reverses direction, re-writes the same data which then is read back as accurately stored.
A temporary error is one in which data is lost, or an operation fails, but the error may be overcome using well-known recovery techniques performed by the data storage system. One example of a temporary error rate is when a block of data read from or written to a tape fails a Cyclic Redundancy Check (CRC). Such an error is typically recoverable but delays the data operation. A temporary error affects data throughput for the data storage transaction. For example, in response to a temporary error, a tape drive may stop advancing and reverse to allow for a second attempt at reading or writing to the storage media.
A permanent error or hard error is one in which data is lost, or an operation fails, and the data storage system is unable to recover the data or complete the operation as requested. One example of a permanent error is an attempt to read data from a portion of tape having a longitudinal crease. Permanent errors within a drive or volume may have serious consequences because data may be lost. Generally, soft errors are resolved by a media drive and are not reported. A media drive reports temporary errors and permanent errors to a host. Of course those of skill in the art are familiar with a host of other examples of soft errors, temporary errors, and permanent errors that may be tracked.
Generally, before a drive or volume experiences a permanent error, the drive or volume presents a trend of soft and temporary errors. This trend may, however, be sporadic. It is desirable to identify failing drives and volumes before one or more permanent errors occur by identifying these trends. Accordingly, performance data is collected for each drive and each volume. The performance data may be collected for each mount of a volume, for each transaction conducted on the drive, or for a combination of these over time. The performance data may include temporary errors such as the number of blocks successfully processed before the soft error. In addition, performance data may include a total number of soft errors for a given mount of a volume, or for the life of the drive.
Conventional data storage systems collect a large quantity of complex performance data. Software engineers have written complicated software to identify a failing drive or volume based on the performance data. Typically, this software uses as much performance data as possible to determine from past and present performance of a drive or volume whether the drive or volume will likely fail soon and cause a permanent error. Generally, the software includes a series of conditions defined by discrete threshold values. If the performance data crosses the threshold value, the software causes the data storage system to advise a user to service or replace the specific drive or volume.
Unfortunately, this conventional software has several limitations. First, the high number of input variables available within the performance data results in complicated software and routines that require an experienced software engineer to modify and refine. Consequently, those with the most experience with the data storage systems are not directly involved in developing algorithms to identify failing drives and volumes.
Second, discrete threshold values in conventional software often do not adequately reflect the relationship between different values in the performance data. Because of the many different operating conditions a data storage system may experience, a discrete threshold does not necessarily mean a direct cause and effect relationship between the performance data and imminent failure of the drive or volume. Other factors, such as unusually high performance demands, may cause performance data to cross discrete thresholds. Consequently, the conventional software reports costly false-positives, reporting that a drive or a volume should be repaired or replaced, when the drive or volume is in fact in satisfactory condition.
Third, the conventional software includes predefined thresholds that determine when repair or replacement is advised. Certain end-users may desire that the software be more sensitive to the risk of data loss. Currently, an end-user is unable to balance the risk of losing data due to a permanent failure of a drive or a volume against the costs of following the advised repair or replacement the drive or the volume.
Fourth, the conventional software includes a rigid, prolonged development cycle. The software is typically implemented in microcode of a drive or a sub-system of the data storage system by software engineers who lack the extensive experience of those involved in the day-to-day operations of data storage systems. Modification of the software typically requires changing the programming code, compiling the programming code into microcode, uploading the microcode, and testing of the microcode on a drive or in an a physical test environment using a battery of tests. If the software fails to perform as expected, this time-consuming modification process must be repeated.
Thus, it would be an advancement in the art to provide an apparatus, system, and method for developing failure prediction algorithms in which those most experienced with data storage system drives and volumes directly may contribute to designing and drafting of failure prediction algorithms. In addition, it would be an advancement in the art to provide an apparatus, system, and method that accommodates imprecision inherent in forecasting failure of drives or volumes of a data storage system. It would be a further advancement in the art to provide an apparatus, system, and method that allows an end-user to adjust the sensitivity of the failure prediction algorithm according to the amount of risk of data loss the end-user is willing to bear. Furthermore, it would be an advancement in the art to provide an apparatus, system, and method that shortens the development cycle for a failure prediction algorithm and facilitates the testing of failure prediction algorithms developed. Such an apparatus, system, and method are provided herein.