This invention relates to storage systems and more particularly relates to a system and method for executing preventive maintenance of storage array systems.
Redundant Arrays of Independent Disks (RAID) store large amounts of user data into a collection of disks. There are a plurality of levels of the RAID, such as levels 0 to 5, having different characteristics of reliability, data availability, and cost performance.
In terms of reliability, the RAID protects the user data against loss or inaccessibility due to disk failures. Part of the RAID""s physical storage capacity is used to store redundant or back-up data about the user data stored on the remainder of the physical storage capacity. The redundant data enables regeneration of the user data in the event that one of the array""s member disks or the access path to it fails.
For example, a RAID system of level 4 (hereinafter, referred to as xe2x80x9cRAID 4xe2x80x9d) usually includes a plurality of data disks for storing user data received from a host computer, a parity disk for storing parity data, and. a spare disk for replacing one of the other disks if it fails. In RAID 4, the user data is divided into a plurality of data blocks having a predetermined sequential address and a predetermined size. RAID 4 creates a parity block by carrying out exclusive OR (XOR) operations with a set of corresponding data blocks sequentially addressed on different data disks. The set of corresponding data blocks and the parity block make a xe2x80x9cparity groupxe2x80x9d. Furthermore, the plurality of data blocks and the parity block are respectively distributed into the plurality of data disks and the parity disk in predetermined order.
In the event that one of the plurality of data disks or the parity disk fails completely and data on it becomes entirely unusable, RAID 4 regenerates a data block or a parity block of the failed disk using the remaining data blocks in the corresponding parity group and stores the regenerated data on the spare disk. This operation is referred to as xe2x80x9cHot Spare Operationxe2x80x9d.
The Hot Spare Operation usually fulfills its function when an actual disk failure occurs. However it is also applicable to an exchange of disks in a preventive maintenance routine of the RAID as well as a recovery from an actual failure. When it is applied to the preventive maintenance routine, the RAID detects and counts the total number of errors of every disk. In the event that the total number of errors exceeds a predetermined value (xe2x80x9cthreshold valuexe2x80x9d), the RAID system alarms a necessity for exchanging the particular disk as a failed one to a new disk or automatically executes the Hot Spare Operation.
However, the RAID system judges when to execute the preventive maintenance only from the total number of errors specified as a maximum number of errors. Consequently, the RAID can not distinguish clearly an occasion when the errors are occurring at a normal error rate from an occasion when the errors are occurring at an abnormal error rate which requires preventive maintenance. There is some possibility that the RAID can not recognize a symptom of a fatal failure.
Furthermore, after executing the Hot Spare Operation, the RAID generally disconnects the failed disk from the system. Consequently, the RAID has no tolerance for recovering another disk failure until a new spare disk is attached. If another failure occurs before the attaching, that failure causes an irretrievable data loss.
Accordingly, it is an object of the present invention to provide a system and method for executing preventive maintenance of the conventional storage array system to achieve higher reliability.
A storage array system, consistent with the present invention, comprises a plurality of data storage devices for storing data and a control unit for controlling input and/or output operations of the plurality of data storage devices. The control unit includes means for storing a history of self recovered errors for each of the plurality of data storage devices, means for calculating an error rate of each of the plurality of data storage devices on the basis of the history of errors, and means for judging a reliability of operation of each of the plurality of data storage devices from the error rate.
A storage array system, consistent with the present invention, comprises a plurality of data storage devices for storing data, a spare storage device for replacing one of the plurality of data storage devices, and a control unit for controlling input and/or output operations of the plurality of data storage devices and the spare storage device. The control unit includes means for storing a history of self recovered errors for each of the plurality of data storage devices, means for calculating an error rate of each of the plurality of data storage devices on the basis of the history of errors, means for judging a necessity to execute preventive maintenance of each of the plurality of data storage devices from the error rate, and means for executing the preventive maintenance.