1. Field of the Invention
The present invention relates generally to backup storage systems, and in particular to reference lists used to facilitate resource reclamation in deduplication based storage systems.
2. Description of the Related Art
Organizations are accumulating and storing immense amounts of electronic data. As a result, backup storage systems are increasing in size and consuming large quantities of resources. To cope with storing ever increasing amounts of data, deduplication has become an important feature for maximizing storage utilization in backup storage systems. In a typical deduplication system, files are partitioned into data segments and redundant data segments are deleted from the system. Then, the unique data segments are stored as segment objects in the backup storage medium. As the number of stored segment objects increases, the management of the segment objects requires an increasing share of system resources which can impact the overall efficiency and performance of the deduplication system.
A deduplication based system aims to reduce the amount of storage capacity required to store large amounts of data. Deduplication techniques have matured to the point where they can achieve significant reductions in the quantity of data stored. However, while such techniques may reduce the required storage space, the number of segment objects stored in the system may nevertheless continue to increase. As deduplication systems scale up to handle higher data loads, the management and indexing of the segment objects may become an important factor that affects performance of the systems.
Typically, segment objects have a small size, as small as 4 Kilobytes (KB) in some systems. For a system storing 400 Terabytes (TB) of data, with all segment objects of size 4 KB, 100 billion segment objects would be maintained. As storage requirements grow, the increase in the number of segment objects may create unacceptable management overhead. Therefore, a highly scalable management system is needed to efficiently store and manage large quantities of segment objects.
A particularly challenging issue involves reclaiming resources after a file is deleted from the system. When a file is deleted, the segment objects that make up the file cannot simply be deleted as there is the possibility that some other file stored by the system references one or more of those same segment objects. Only if no other files use those segment objects can they be deleted. Some form of management is needed to keep track of the segment objects and all of the files that use the segment objects. There are a variety of techniques used to manage the segment objects and the files that point to them, most of which may work reasonably well when operating on a small scale. However, many of these approaches may not be efficient when dealing with a large number of segment objects.
One technique used to facilitate resource reclamation is reference counting for segment objects. The reference count stores a value indicating how many files point to, or use, that segment. A segment object's reference count is incremented every time it is used by a file, and decremented when the file using the segment is deleted—eventually the segment may be reclaimed when the count drops to zero.
Reference counting has several limitations which make it unsuitable for deduplication. One limitation is that any lost or repeated update will incorrectly change the count. If the count is accidentally reduced, the segment may be deleted while it is still being used by at least one file. If the count is accidentally increased, then the segment may never be deleted even after all of the files using it are deleted from the system.
A further shortcoming of reference counting is that it does not allow for identifying which files use a given segment object. If a segment object gets corrupted, the backup system would need to know which files are using it, so that the file can be requested to recover the corrupted data. However, reference counting does not maintain a listing of which files are using each particular segment object, making recovery of corrupted data more difficult.
Another tool that can be used to facilitate resource reclamation is a reference list. Maintaining a reference list does not suffer from the inherent shortcomings of reference counting. A reference list may have greater immunity to mistaken updates, since the list can be searched to see if an add or remove operation has already been performed. Also, reference lists have the capability to identify which files are using each segment object. However, a reference list is not readily scalable to handle a large number of segment objects. Traditionally, a reference list is managed at a fine level according to each segment object that is stored. As the number of segment objects increases, updating the reference list may take a longer period of time, which may slow down system performance. What is needed is a new method for maintaining a reference list that can efficiently manage large numbers of segment objects.
In view of the above, improved methods and mechanisms for managing reference lists in a deduplication system are desired.