HSM applications are used to manage efficient storage of data and migrate files between a hierarchy of storage devices in a storage repository, such as online storage and various forms of offline storage. HSM managed files are paired objects, with a local stub file paired with an associated object on the server containing the file contents. The local stub file remains in place of a migrated file after the file is migrated from its original storage location to a different storage location.
The stub file contains file attributes and information used to locate the migrated file contents in the repository storage location. The stub file contains all of the required file attributes as well as an index to a database that contains locator information. The HSM application is responsible for keeping the stub file updated in the event that the migrated file contents are moved to a new repository storage location.
The file system uses the stub file to access the original file contents, keeping subsequent migration and premigration of the file contents transparent to the users and applications accessing the file. It is the job of an HSM application to process requests to access migrated files. The HSM application uses the information and pointers stored in the stub file to recall the migrated file contents from the storage repository and deliver it to the initial location.
Over time, often as a result of data loss following system crashes or system configuration changes, the pairing of the local stub and its associated server object may be broken, resulting in file system client orphans and server object orphans. A server orphan is created by the deletion of the stub file in the file system on the HSM client side, which results in a server object no longer referenced by the file system HSM client. A file system client orphan is created by deletion of the corresponding object on the server side, which results in the stub file no longer having an associated server object.
Storage systems utilize data backup of the file system to reduce data loss, but restoring data from a backup does not eliminate file system corruption. The contents of the file system almost always change in the time period after a backup and before the restore, resulting in inconsistencies within the HSM file system.
HSM controlled storage environments require periodic system checks and routine maintenance to ensure file system integrity. Following a system crash or when any portion of the file system has been restored from a backup, the file system should be verified for data consistency. Each stub file needs to be verified that it points to a valid server object, and each server object needs to be verified that it has a corresponding stub file. Checking for both client and server orphans is called a two-way orphan check. Because the file system would not contain pointer information to access orphaned server objects, server orphans should be removed to avoid wasting storage space. Likewise, if there is no server object associated with the local stub file, the stub file should be flagged as a client orphan. The orphan checking process is very time consuming and requires significant system resources.
With the intent to avoid long running file system scans, existing systems take various approaches to check for both file system client orphans and server orphans. In some existing systems, client orphans and server orphans are identified serially, identifying each class of orphan in independent processes. In other existing systems, the independent orphan identification processes may be executed in parallel, but not in a single-pass process.
Another limitation in existing systems is that the methods used to identify orphans require caching of the entire list of file object pointers. As the number of files in the file system becomes larger, the size of the cache and the system resources required to manage the cache grows proportionally, rendering existing orphan checking systems practically infeasible for file systems containing billions of files.
Finally, existing systems are limited in their ability to identify orphans early in the orphan checking process. Early identification would allow parallel execution of tasks for potential recovery steps associated with the identified orphan. Existing systems require the entire list of file object pointers to be processed before it can be determined that an orphan exists.
What is needed in the art is a way to identify both server orphans and file system client orphans in extremely large file systems in a single pass. Further, the identification routine must be capable of matching a list of migrated files generated from the file system information with a list of migrated files generated from the storage repository information, where each list is sorted by a field such as a unique migrated file identifier.