1. Field of the Invention
Exemplary embodiments of the present invention relate to data storage systems, and, more specifically, to data storage systems that store snapshots indicating the status of stored data at particular points in time.
2. Description of Background
Many data storage systems organize stored data according to a file metaphor. In these storage systems, related data are stored in a file, and the data storage system stores multiple files. The data storage system then stores references to the multiple files to enable access to the data in those files. A single file may be stored in contiguous or disparate locations in the data storage device. Storage of data in disparate locations in a data storage device often results when a large data file is to be stored on a device that already stores many files and the large data file must be broken up into data blocks to fit in the free areas within the storage device. Data are also often stored in disparate locations when additional data is added to an existing file. The assembly of stored data into structured files on a data storage device is referred to as a file system.
Data storage systems often store point-in-time copies or images of the data of all files that are currently stored in the file system. These images are referred to as snapshots (or clones or flash-copies). The content of a snapshot is the data that is stored within the active file system at the time the snapshot was captured. Data storage systems can use snapshots to store the state of the file system on a secondary storage system such as another disk drive or magnetic tape storage system. Data storage systems can also use file system snapshots to enable recreation of data that has been deleted (that is, to access previous versions of files that have been deleted or updated).
To minimize the time to create a snapshot as well as the storage space for maintaining the snapshot, some methods for taking snapshots of a file system defer the actual copying of the data in the original file system to the snapshot until the data in the original system is modified (for example, overwritten or deleted). Because the data is not copied to the snapshot data until a write is performed on the original data, systems employing methods of this type are referred to as “copy-on-write” systems. Copy-on-write techniques are often used to implement file versioning, which provides for the concurrent existence of several versions of files in a file system by maintaining snapshots of individual files rather than the whole system.
Copy-on-write systems can utilize metadata, which are control structures created by the file system software to describe the structure of a file and the use of the disks that contain the file system, so that non-modified data blocks of a modified file need not be copied to the snapshot. These systems create snapshot metadata sets that include file references that describe the locations of the original data file in the original file system so that the non-modified data-blocks can be referenced from metadata within both the original file and the snapshot copy of the file. This creates multiple references to the same data block in the original file system: the reference in the metadata of the original file system and the references in each of the snapshot data sets.
The existence of multiple references to a single data block within the original file system impacts the requirements of the original file system. File systems that utilize snapshots that each store a reference to an original data block must maintain an indication or mapping of each reference to that data block in order to determine if the data block is in-use or free. Without multiple references, a single bit may be sufficient to indicate if a data block is in-use or free. With the multiple references, however, multiple bits may be required to track the multiple references and ensure that no references exist to the data block prior to declaring the data block “free.”
Because higher-speed storage devices (such as hard disk drive arrays) are more expensive (per byte stored) than slower devices (such as optical discs and magnetic tape drives), some larger file systems employ a Hierarchical Storage Manager (HSM) to automatically move data between high-cost and low-cost storage media. In a file system using an HSM (such as, for example, IBM's ADSTAR Distributed Storage Manager, Tivoli's Storage Manager Extended Edition, or Legato's NetWorker), most of the file system data is stored on slower offline devices and copied to faster online disk drives as needed. An HSM monitors the use of data in a file system, identifies which files in a file system have not been accessed for long periods of time, and migrates all or some of their data to slower storage devices. This frees space in the faster online storage, thereby allowing additional files and more data to be stored. In effect, an HSM provides an economical solution to storage large amounts of data by turning faster disk drives into caches for the slower mass storage devices.
In a typical HSM scenario, data files that are frequently used are stored on hard disk drives, while data files that are not used for a certain period of time are migrated to magnetic tape drives. When a user attempts to access a data file that has been migrated to tape, the file is automatically and transparently restored to online hard disk drives, allowing the operation to complete as if the data had never been migrated. The advantage is that while the total amount of stored data can be much larger than the capacity of the disk storage available, because only rarely-used files are on tape, users will typically not notice any slowdown.
The inventors herein have recognized that, in file systems that utilize snapshots, the need to track multiple references to a single data block can significantly complicate the operation of the file system, particularly if the file system also employs an HSM. For instance, when an HSM migrates a file to tape, it expects to be able to reclaim the disk space occupied by the file's data blocks. In the presence of snapshots, however, these data blocks may still be referenced by snapshots from older versions of the file and, therefore, cannot be freed until all other versions of the file have been migrated to tape as well. Moreover, while the HSM can reclaim all disk space occupied by the file once all versions of a file have been migrated, data blocks that had been stored singularly and shared by snapshots of different file versions will be stored redundantly as separate copies on tape. That is, snapshots that can occupy very little space on disk will occupy just as much space as the entire file system on tape. Additionally, when the HSM returns the migrated file to online storage, new data blocks will be allocated for the returned data and the other online references to the original blocks cannot be located. As a result, restoring a migrated file may result in unnecessary copying of the data as well as require more online storage than files which have never been migrated.
Accordingly, the inventors herein have recognized a need to provide for efficient hierarchical storage management within a file system that utilizes snapshots.