Some storage systems, such as backup or shared storage service systems, may store large amounts of data from many different computers. For example, a backup storage system may regularly receive data from many different end-point computer systems (e.g., desktops in an organization) and store backup copies of that data in a data store. Each source computer may insert files (or file segments) into the storage system by request and later remove any of the files or segments by subsequent request. As used herein, the term file may refer to files, file segments, or any other data that a source computer system may insert into a storage system.
Files normally include respective file attributes (e.g., filename, path, size, owner, modification/access history, permissions, etc.) and file content (i.e., file data). While each file may have unique file attributes that uniquely identify that file, multiple files inserted into a storage system may have the same content portion (or overlapping content portions). Storing multiple files with the same file content is redundant and is an inefficient use of limited storage resources.
To reduce the amount of storage space required when storing multiple files with the same or overlapping content, a storage system may be configured to employ data deduplication techniques. When multiple files with the same content are inserted into a deduplicated storage system (DSS), the DSS may store one or more file attributes for each file but only a single copy of the file content that is common to the multiple files. The single data object of the file content stored by the DSS may be referred to herein as a data object. The DSS may maintain metadata indicating that the multiple files each correspond to the same stored data object.
When a source computer requests that a given file be removed from the DSS, the DSS may update the metadata to remove the relationship between the source computer and the stored data object. When the metadata indicates that no source computer has a relationship to the stored data object (i.e., every source computer that has requested the file be inserted has subsequently requested the file be removed) the stored data object becomes expired (i.e., no longer needed by any source computer) and the DSS may delete the expired data object from the storage data store.
A DSS therefore requires an efficient and correct method for detecting expired data objects. Traditional techniques for detecting expired data objects often involve mark-and-sweep techniques. In a mark phase, the traditional DSS traverses the metadata records and for each inserted file, finds and marks the corresponding stored data object. In a subsequent sweep phase, the DSS traverses the stored data objects and deletes each one that is not marked. However, mark-and-sweep techniques are time consuming and result in expired data objects remaining in the data store for considerable periods between sweeps.