a. Field of the Invention
The present invention pertains generally to data storage and more specifically to a system and method of active reliability management for data storage systems.
b. Description of the Background
Data storage systems can comprise an array of disc drives connected to one or more disc array controllers using one or more buses. Disc array controllers may be connected to one or more host systems using one or more buses. Data storage formats, such as RAID (Redundant Array of Independent Discs), may be employed to distribute user data and redundant information across multiple drives such that if a drive fails, user data may be copied, regenerated, or reconstructed (regenerated and copied to another drive) from remaining operating drives. Systems may also employ redundant controllers and/or buses such that if a connection path or controller fails, another path or controller may be available to transfer data and commands.
The ability of a data storage system to operate for long periods of time without failure reflects the number and quality of components and directly affects the value and marketability of the system. Selection of components with very long MTBF (mean time between failure) ratings can increase probable operating life of a storage system, but usually at increased cost. Higher levels of redundancy, such as additional spare controllers, buses, and/or storage devices may also result in higher system cost.
While systems with high levels of redundancy may be repaired without loss of data, throughput and data availability may be limited until failed components are replaced. Scheduled replacement of components at times of low system utilization may increase data availability during times of high demand, but at the additional cost of parts and labor. This may also result in higher component operating costs since some components that would otherwise continue to function may be replaced.
Present storage system management methods are reactive in nature, such as suggesting replacement of drives with error rates above a predefined threshold, for example. This method is problematic in that a threshold set too low fails to identify components prior to failure whereas a threshold set too high results in unnecessary replacement of components incurring additional cost and downtime.