In general, data storage arrays (herein also referred to as “data storage systems”, “disk storage arrays”, “disk arrays”, or simply “arrays”) are called upon to store and manage increasingly larger amounts of data, e.g., in gigabytes, terabytes, petabytes, and beyond. As a result, it is increasingly common or necessary that this large amount of data be distributed across multiple storage devices (e.g., hard disk drives, etc.) or other storage entities.
It will be known that some conventional data storage arrays treat a collection of storage devices as a unified pool of data storage space that is divided into equal sized portions or slices. These data storage arrays can then allocate the slices to logical units. A logical unit can be a subset of a single storage device, e.g., a hard disk drive may contain multiple logical units; a logical unit can be an entire storage device; and a logical unit can span multiple storage devices (e.g., a logical unit may be distributed across multiple storage devices organized into a redundant array of inexpensive disks (RAID) array).
Some of these conventional data storage arrays are also equipped with a recovery program which enables the conventional data storage arrays to recover metadata resulting from corrupted metadata in connection with a storage object (e.g., storage pool, etc.). Along these lines, suppose that corrupted metadata is detected in connection with a storage pool. In this situation, the pool is taken offline and the recovery program is started. For the recovery program to run properly, the recovery program borrows slices from the pool of slices, and then uses the borrowed slices as scratch space to recover the metadata (e.g., the recovery program may apply error checking and error correction algorithms to remaining uncorrupted portions of file system metadata to recreate the metadata). Once the metadata is properly recovered by the recovery program, the recovery program terminates and the borrowed slices are released back to the pool.
It should be understood though that this approach to recovery is not without problems. For example, it is possible for a data storage array to allocate all of the slices of the pool to logical units. In such a situation, suppose that the data storage array then discovers a pool requiring recovery. Unfortunately, since there are no available slices left in the pool for the recovery program to borrow, the recovery program is unable to run, and data recovery fails. That is, the lack of available slices prevents (i.e., starves out) the recovery program from operating, resulting in what may initially only have been a DU situation (i.e., data unavailable situation) being escalated to a DL situation (i.e., data lost situation).
In order to deal with this problem, techniques were introduced in which slices were pre-allocated from the general pool slices to support recovery. With such pre-allocation, there may be an adequate amount of storage to use as scratch space/work space when recovering metadata. However, the pre-allocation of slices does not completely eliminate the chance of a data loss situation. For example, a slice allocation table (SAT) that is used to record information about each slice (e.g., the logical unit that is using the slice, whether the slice is free or allocated, etc.) may become corrupted and allow a slice originally pre-allocated for pool recovery to be handed out to a logical unit.
In light of the above, there is, therefore, a need for other approaches for dealing with recovery.