1. Field of the Invention
The present invention relates in general to the field of data processing systems, and more particularly, the present invention relates to managing data in a networked data processing system environment incorporating a single-instance-storage volume.
2. Description of the Related Art
An ever-increasing reliance on information and computing systems that produce, process, distribute, and maintain such information in its various forms, continues to put great demands on techniques for providing data storage and access to that storage. Business organizations can produce and retain large amounts of data. While data growth is not new, the pace of data growth has become more rapid, the location of data more dispersed, and linkages between data sets more complex.
Generally, a data deduplication system provides a mechanism for storing a piece of information only one time. Thus, in a backup scenario, if a piece of information is stored in multiple locations within an enterprise, that piece of information will only be stored one time in a deduplicated backup storage area. Similarly, if the piece of information does not change during a subsequent backup, that piece of information will not be duplicated in storage as long as that piece of information continues to be stored in the deduplicated backup storage area. Data deduplication can also be employed outside of the backup context thereby reducing the amount of active storage occupied by duplicate files.
The storage area of a data deduplication system is called a single-instance-storage (SIS) volume. Sets of data segments are stored in the SIS volume. As new sets of data segments are stored, previously stored sets of data are checked for duplicate segments of data. Duplicate data segments of data in a new set of data segments are not stored in the SIS volume, but instead are presented by pointers to the previously stored data segments. As new sets of data segments are stored, the physical location in the storage device of stored data representing those data segments tends to be scattered all over the SIS volume by the nature of the use of pointers to previously stored data segments. While this use of pointers saves space by eliminating duplicate data segments, the resultant physical scattering of stored data creates inefficiencies for data retrieval because access to later-stored sets of data segments from the SIS volume involves a large number of random disk accesses. Thus, there is a need for a mechanism to enable more efficient data retrieval from a SIS volume.