As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
An ongoing concern in many information handling systems, and particularly data storage systems, is data reliability. Of course, many solutions have been developed to increase data reliability, including for example, the utilization of RAID (redundant array of independent disks) systems, which, in general, combine multiple disk drives into a logical unit with data being distributed across the drives in one of several ways called RAID levels, depending on the level of capacity, redundancy, and performance desired or required. See, David A. Patterson, Garth Gibson, and Randy H. Katz: A Case for Redundant Arrays of Inexpensive Disks (RAID); University of California Berkeley, 1988. RAID techniques have generally increased data reliability.
Nonetheless, there are several scenarios where failing disks can leave user data in an unrecoverable state. For example, in one single-redundant RAID scenario, a particular disk may accumulate too many error recovery attempts and thus trigger a rebuild in order to migrate each RAID extent from the failing disk to a spare disk. In the meantime, while the extent is rebuilding, data of another disk within the rebuilding stripe may become unreadable due to a latent error, i.e., an error that is not readily apparent because the data block is written to, but yet the data is not readable. A read of this data is required, however, in order to reconstruct the data of the disk being rebuilt, and thus the rebuild cannot continue, leaving the user data in an unrecoverable state.
Data scrubbing has been introduced as a means for periodically reading and checking, by the RAID controller, all the blocks in a RAID array to detect bad blocks before they are used. However, conventional RAID scrubbing does not detect latent errors quickly enough for significantly improving data reliability. The conventional RAID scrub operation works on a single RAID device at a time and works on RAID logical block addresses rather than “vertically,” so to speak conceptually, on disks or disk extents. As a scrub progresses through the stripes on a RAID device, it sends input/output (I/O) to all of the disks associated with the RAID device. In the case where a particular disk is suspect, it reads all of the other disks of the RAID device, which wastes valuable time when the suspect disk is in immediate jeopardy of failing. Additionally, in a system with multiple disk tiers, the conventional RAID scrub operation is not prioritized to favor disk types, such as those that have a higher tendency for failure. For example, if the disks in a lower, less expensive storage tier are suspect to failing relatively more often than disks in other relatively more expensive storage tiers, time scrubbing disks in the higher, relatively more expensive storage tiers may essentially be wasted.
In view of the foregoing, if it is suspected that a disk is in jeopardy of failing, it can be very useful to know before degrading that disk for replacement that the associated RAID stripes of all extents on that disk can be read in order to reconstruct all the data, or as much data as possible, residing on the failing disk. With the conventional RAID scrub operation, there is generally no way to quickly and efficiently determine this, absent launching scrubs on all RAID devices associated with all disks within a storage tier. Launching scrubs on all RAID devices associated with all disks within a storage tier, however, is simply too slow and consumes too many resources. A specific example of this problem is provided in FIG. 1, which illustrates an example data storage system 100 showing 10 separate disks, with only “Disk X” 102, which is illustrated vertically in the figure for simplicity, being fully shown and labeled for purposes of discussion. As can be seen from FIG. 1, data has been distributed across the 10 shown disks in three RAID configurations: RAID 5 spread across 5 extents; RAID 10 spread across 2 extents; and RAID 6 spread across 6 extents. As will be appreciated by those skilled in the art, the actual physical configuration and layout of the extents and RAID stripe data will typically depend on several factors; accordingly, FIG. 1 serves only as a conceptual example for purposes of discussion. Consider Disk X to be failing or otherwise returning too many significant errors. In order to determine whether all the data on Disk X can be reconstructed utilizing the conventional scrubbing operation, the entirety of the RAID 5, RAID 10, and RAID 6 devices would need to be scrubbed. However, this determination could more efficiently be made if there was a system and method for reading or surveying only the information contained in the horizontal stripes shown in this diagram. Now, consider that the data storage system 100 comprises a significantly larger number of disks than simply the 10 shown, for example, 90 additional disks with data similarly distributed; the efficiencies of such a novel system and method would be increased significantly.
Accordingly, the conventional RAID scrub operation is insufficient to determine the desired information relating to associated RAID stripes of all extents on a failing disk. There is a need in the art for improved methods for determining this information at a disk or disk extent level. More generally, there is a need in the art for systems and methods for surveying a data storage system for latent errors, and particularly, systems and methods for surveying a data storage subsystem or other information handling system for latent errors prior to disk failure, thereby improving fault tolerance.