Personal computer systems are constantly improving in terms of speed, reliability, and processing capability. As a result, computers are able to handle more complex and sophisticated applications. As computers improve, performance demands placed on data storage systems increase. In particular, there is an increasing need to design data storage systems which protect data in the event of system degradation or failure.
As an initial step toward the goal of data preservation, advances have been made at the component level. Today, some components in a data storage system have enough intelligence to self-detect failures. For example, disk drives have built-in intelligence which senses and isolates bad sectors of memory, and remaps the data into other operable sectors in an effort to ensure data quality.
Redundancy is one technique that has evolved for preserving data at the component level. The term "RAID" (Redundant Array of Independent Disks) is often used to describe a disk array in which part of the physical storage capacity of the disk array is used to store redundant information about the user data stored on the remainder of the storage capacity. The redundant information enables regeneration of user data in the event that one of the array's member disks or the access path to it fails.
In general, there are two common methods of storing redundant data. According to the first or "mirror" method, data is duplicated and stored in two separate areas of the storage system. For example, in a disk array, the identical data is provided on two separate disks in the disk array. The mirror method has the advantages of high performance and high data availability due to the duplex storing technique.
In the second or "parity" method, a portion of the storage area is used to store redundant data, but the size of the redundant storage area is less than the remaining storage space used to store the original data. For example, in a disk array having five disks, four disks might be used to store data with the fifth disk being dedicated to storing redundant data. The parity method is advantageous because it is less costly than the mirror method, but it also has lower performance and availability characteristics in comparison to the mirror method.
Storage systems are becoming more complex, and typically involve a sophisticated interconnection of many individual components. An example storage system might comprise disk arrays, controllers, software, archival-type storage units, power supplies, interfacing and bussing, fans, a cabinet, etc. As storage systems become more complex, the traditional component level techniques of detecting failure are not well suited when analyzing the storage system as a whole. For instance, examining the disk drive for possible failure is only a single piece of the puzzle concerning operation of the entire storage system and can be misleading if the remaining components of the system are not taken into account.
For example, the RAID algorithms have the ability to reconstruct data upon detection of a disk drive failure. Drive failure is often expressed in terms of performance parameters such as seek-time and G-list. Whenever the drive is not behaving as expected according to its performance parameters, the RAID system may be tempted to reconstruct the disk drive in an effort to cure the irregularity. The disk drive problem, however, may not be due at all to the drive itself, but instead might be caused by some external parameter such as a controller error or a software bug. In this case, looking at a particular disk drive characteristic such as seek-time or G-list is meaningless.
As the number of components within a data storage system grows and the interdependency among the components increases, the problems arising at the component level may detrimentally affect operation of other components and thus, the system as a whole. Additionally, the various component problems affect the system to different degrees. There is therefore a need to design data storage systems that look more globally at the entire data storage system apart from the component level.
Today, there are two software programs available which merely warn the user of a component failure within a data storage system. One software product is sold by Compaq under the trademark INSIGHT MANAGER.TM., and the other is sold by Hewlett-Packard under the trademark OPENVIEW.TM.. However, neither of these software products evaluate the storage system as a whole, nor relate how the failed component impacts the operation of the entire system.
The system user is perhaps most interested in the practical impact that a component failure has on data availability of the data storage system. "Availability" is the ability to recover data stored in the storage system even though some of the data has become inaccessible due to failure or some other reason and the ability to insure continued operation in the event of such failure. "Availability" also concerns how readily data can be accessed as a measure of storage system performance.
Accordingly, there is a need to evaluate the entire system from a global perspective, and translate any component failure into impact on data availability. This information is of value to the user so that the user can intelligently decide whether corrective action is desired and what steps should be taken to cure the failed component.