If a software error corrupts a data object, or if erroneous data updates the data object, a data protection administrator may restore the data object to a previous state that does not include the corrupted or erroneous data. A backup/restore application executes a backup operation either occasionally or continuously to enable this restoration, storing a copy of each desired data object state (such as the values of data and these values' embedding in a database's data structures) within dedicated backup files. When the data protection administrator decides to return the data object to a previous state, the data protection administrator specifies the desired previous state by identifying a desired point in time when the data object was in this state, and then instructs the backup/restore application to execute a restore operation to restore a copy of the corresponding backup file(s) for that state to the data object.
An object that is stored in a computer system may be represented by a data structure, such as the tree structure 100 depicted by FIG. 1. A computer system can divide an object into smaller objects, such as dividing a file into files segments. Examples of file segments include a super segment 102 or a level 6 (L6) segment, which may be at the root of the tree structure 100, metadata segments 104, or level 5 (L5) segments to level 1 (L1) segments, which may be intermediate nodes in the tree structure 100, and data segments 106, or level 0 (L0) segments, which are the leaf nodes of the tree structure 100. The level 6 (L6) segments to level 1 (L1) segments may be referred to as level P (Lp) segments. Although this example describes the tree structure 100 as having 7 (L0-L6) levels, the tree structure 100 may have any number of levels.
Each object may be referenced by its fingerprint, which is a relatively short bit string that uniquely identifies an object. For example, FIG. 2A depicts the file segments 200 that can be referenced by the fingerprints 202, such as the fingerprint G 204 that uniquely identifies the file segment G 206.
A garbage collector generally refers to an organizer of storage for retrievable data in a computer system, which automatically identifies a computer system's objects, identifies which objects are live objects, which are the objects that are in use by at least one of the computer system's programs, and reclaims storage occupied by dead objects, which are the objects that are no longer in use by any of the computer system's programs. A garbage collector can begin by executing what may be referred to as a merge phase, which includes storing an index of unique identifiers of a computer system's objects, such as by storing an index of fingerprints for file segments to a disk. The fingerprint index can map each fingerprint to the object storage, which may be referred to as a container, that stores the file segment which is uniquely identified by the fingerprint, such as the fingerprint index that includes the fingerprint G 204 also includes a mapping to the container that stores the file segment G 206.
The garbage collector can continue by executing what may be referred to as an analysis phase, which includes applying a hash function to each fingerprint in the fingerprint index to generate a one-dimensional array that may be referred to as a hash vector, such that the positions in the hash vector correspond to the fingerprints that uniquely identify their file segments. For example, FIG. 2B depicts that the garbage collector applies the hash function 208 to the fingerprints 210 to generate the hash vector 212. Consequently, the bit 214 in the hash vector 212 corresponds to the hash, which is the value returned by a hash function, of the fingerprint G 216, which is the fingerprint G 204 that uniquely identifies the file segment G 206 in FIG. 2 A. Although the example describes a computer system as having 7 file segments, fingerprints, and corresponding bits in the hash vector, a computer system may have any number of file segments, fingerprints, and corresponding bits in the hash vector.
The garbage collector can continue by executing what may be referred to as an enumeration phase, which includes identifying the active objects, and then indicating these identifications in the bits of the hash vector that correspond to the objects' unique identifiers. For example, the garbage collector conducts a level-by-level review of the metadata segments 104 to identify their L0 and Lp references, which include the fingerprints of the live L0 data segments 106 and the live Lp metadata segments 104, each of which are in use by at least one of the computer system's programs. Then the garbage collector can continue the enumeration phase by applying the hash function 208 to these identified fingerprints to create hashes, and then setting the bits in the hash vector that correspond to these hashes, such as setting some of the bits 302 to 1 in the perfect hash vector 304 depicted by FIG. 3.
The garbage collector can continue by executing what is referred to as a selection phase, which includes estimating how much of the data storage in each container is for live objects. For example, the garbage collector identifies the fingerprints for the L0 data segments in the container 140, applies the hash function 208 to these identified fingerprints to create hashes, and then checks the bits in the perfect hash vector 304 that correspond to these hashes. If the bit for a fingerprint's hash is set to 1 in the perfect hash vector 304, then the bit corresponds to a fingerprint of a live object. If the bit for a fingerprint's hash is not set to 1 in the perfect hash vector 304, then the bit corresponds to a fingerprint of a dead object.
As part of the selection phase, the garbage collector can continue by selecting a container for garbage collection, which may be referred to as cleaning, based on the number of the objects in the container that are live objects. For example, if the garbage collector has determined that only 10% of the file segments in the container 140 are dead file segments, which are not in use by any of the computer system's programs, then the garbage collector bypasses selection of the container 140 for garbage collection or cleaning, and therefore retains the container 140 as it is. Continuing this example, the garbage collector resets the bits in the perfect hash vector 304 that correspond to the hashes of the fingerprints for the file segments in the container 140, which enables the subsequent processing of containers to not require retention of these file segments, which may be referenced as duplicates in other containers.
In an alternative example, if the garbage collector has determined that 40% of the file segments in the container 140 are dead file segments, then the garbage collector selects the container 140 for garbage collection or cleaning. The garbage collector may evaluate multiple containers in the cleaning range 306 to select any combination of these containers in the cleaning range 306 for garbage collection or cleaning. Although the example describes 40% of a container's dead file segments as exceeding a cleaning criteria or container selection threshold, any cleaning criteria or container selection threshold may be used.
The garbage collector might complete by executing what may be referred to as a copy phase, which includes copying live objects from a selected container that will be reclaimed into another container that will be retained. Continuing the alternative example, the garbage collector creates the new container 250, copies the live file segments in the container 140 into the new container 250, and resets the bits in the perfect hash vector 304 that correspond to the hashes for the fingerprints of the file segments in the new container 250, which enables the subsequent processing of containers to not require retention of these file segments. Possibly completing the copy phase for the alternative example, the garbage collector deletes the container 140, which is a cleaning or a garbage collection that reclaims unused storage space for subsequent reuse. Although examples use numbers such as 140 and 250 to reference containers, any other type of sequential referencing of containers may be used.
If the garbage collector executes the merge, analysis, enumeration, selection, and copy phases while no additional objects are being written to a computer system's containers, then the garbage collector can complete the execution of its phases as described above. However, garbage collection may require significantly more time to complete than the duration of time that a computer system can temporarily suspend the writing of objects to containers. Therefore, the garbage collector can accommodate the writing of objects to containers, which may be referred to as an ingest, while the garbage collector is concurrently executing its phases. Furthermore, the garbage collector can accommodate the writing of deduplicated objects to containers while the garbage collector is concurrently executing its phases. Data deduplication generally refers to a specialized data compression technique for eliminating redundant copies of repeating data.
The garbage collector can identify the writing of objects to containers which occur after the garbage collector started the merge phase until the garbage collector started the enumeration phase, which is depicted as the all live range 308 in FIG. 3 because all of the objects written during this time period are live objects since they have just been written. The garbage collector disables the deduplication of all Lp metadata from the start of the merge phase through the start of the enumeration phase so that metadata is written is written to new containers, such that the garbage collector can review these new containers for references to data segments during the enumeration phase. For example, between the times that the garbage collector started the merge and enumeration phases, a backup/restore application duplicated data from the container 140 and the corresponding metadata to the new container 180, and also wrote new data to the new container 190 and the corresponding metadata to the new container 200. After starting the enumeration phase, the garbage collector reviews the new containers 180-210, identifies the L0 and Lp references in the metadata segments in the containers 180 and 200, which identify the fingerprints of the live L0 data segments and the live Lp metadata segments in the containers 140 and 190, applies the hash function 208 to these identified fingerprints to create hashes, and sets the bits in the hash vector that represent these hashes. When subsequently processing the containers in the cleaning range, the garbage collector will reference the bits for the hashes of the fingerprints for the L0 data segments in the container 140 as indicating live file segments. However, since the new container 190 is not in the cleaning range, the garbage collector will not reference the bits for the hashes of the fingerprints for the L0 data segments in the container 190 as indicating live file segments, such that the garbage collector might not reset the bits for the hashes of the fingerprints for the L0 data segments in the container 190.
Writing an object to a container can resume the use of a dead object. For example, a program in the computer system created the file 60 that included the file segment Z, the backup/restore application wrote the file 60 to the container 160, the program deleted the file segment Z, and the backup/restore application wrote metadata that indicates the deletion of the file segment Z to the container 160. Since the garbage collector has yet to delete the file segment Z from the container 160, the file segment Z is a dead file segment, and the fingerprint index still includes the fingerprint Z for the file segment Z, and still maps the fingerprint Z to the container 160. Then a user of the program instructed the backup/restore application to restore the file segment Z from a backup file, and the program is currently using the restored file segment Z.
The backup/restore application may create a notification to write file segments which include the revived file segment Z when the garbage collector is not executing its phases. Since the fingerprint index still includes the fingerprint Z for the file segment Z, and still maps the fingerprint Z to the container 160, the backup/restore application writes the file segment Z and the corresponding metadata to the container 160 as deduplicated data.
Alternatively, the backup/restore application may create a notification to write file segments which include the revived file segment Z between the times that the garbage collector started the merge and enumeration phases. The garbage collector tracks the resumption of use, or revival, of all dead objects by disabling the deduplication of all Lp metadata from the start of the merge phase through the start of the enumeration phase. Therefore, since the fingerprint index still includes the fingerprint Z for the file segment Z, and still maps the fingerprint Z to the container 160, the garbage collector permits the backup/restore application to write the file segment Z to the container 160 as deduplicated data and write the corresponding metadata to the new container 240. When the garbage collector reviews the metadata in the new containers, which includes the new container 240, the metadata identifies the fingerprint Z of the file segment Z written to the container 160, applies the hash function 208 to the fingerprint Z, and then sets the bits in the perfect hash vector 304 that correspond to the hash for the fingerprint Z of the previously dead file segment Z written to the container 160. When subsequently processing the container 160, the garbage collector will reference this bit as indicating a live file segment, thereby retaining the revival of the previously dead file segment Z.
Since the garbage collector has the capability to track the revival of dead objects, the garbage collector may process the writing of a new object as the revival of a dead object. For example, between the times that the garbage collector started the merge and enumeration phases, a backup/restore application creates a notification to write a new file segment D, and during the enumeration phase the garbage collector applies the hash function 208 to the new fingerprint D for the new file segment D, and then sets the bit in the perfect hash vector 304 that corresponds to the hash for the new fingerprint D. Coincidentally, the hash for the new fingerprint D is the same as the hash for the old fingerprint X of the old file segment X that is a dead segment which is stored by the container 130. Consequently, when the garbage collector processes the container 130 in the cleaning range 306, and reviews the bit set in the perfect hash vector 304 that corresponds to the hash for the new fingerprint D and the old fingerprint X, the garbage collector will process the dead file segment X as a live segment. This collision of bits for the hash of the new fingerprint D and the old fingerprint X may result in the garbage collector not selecting the container 130 for cleaning when the container 130 should have been selected for cleaning, or result in the garbage collector creating a new container for the live file segments of the container 130 and copying the dead file segment X to the new container.
After the enumeration phase starts, the garbage collector can identify additional notifications to write objects to containers. If the garbage collector identifies a notification to write to a container that is in the range of containers that the garbage collector has already cleaned, the garbage collector permits this writing of objects to this container. For example, the garbage collector has already cleaned the containers 180-210, is in the process of cleaning the containers 130-170 in the current batch 402, and then identifies a notification from a backup/restore application to write objects to the container 190 as deduplicated data, as depicted by FIG. 4. Since the garbage collector has already cleaned the containers 180-210, the garbage collector permits the backup/restore application to write the objects to the container 190 as deduplicated data. The garbage collector does not need to apply the hash function 208 to the fingerprints for the file segments written to the container 190 or need to set the bits in the perfect hash vector 304 that correspond to the hashes for the fingerprints of the file segments written to the container 190 because the current processing of containers will not reference these file segments that are only written to a container that is already cleaned.
If the garbage collector identifies a notification to write to a container that is in the range of containers that the garbage collector is currently cleaning, the garbage collector may modify at least some of the writing of objects to this container. For example, the garbage collector has already cleaned the containers 180-210, is in the process of cleaning the containers 130-170 in the current batch 402, and then identifies a notification from a backup/restore application to write objects to the container 150 as deduplicated data. Since the garbage collector is currently in the process of cleaning the containers 140-170, the garbage collector instructs the backup/restore application to write the objects to the container 150 as data that has not been deduplicated.
The data that is written to containers in the current batch 402 is written as data that has not been deduplicated to enable the tracking of dead objects that are being revived. For example, if the backup/restore application wrote the file segment Y to the container 150 as deduplicated data, and the container 150 previously stored the file segment Y as a dead object, the deduplication of data would result in writing metadata, which indicates the revival of the file segment Y, to the container 150 instead of resulting in actually writing the file segment Y again to the container 150. The garbage collector processes the file segments that are actually written to the containers being cleaned as live file segments, such that a file segment that is actually written to a container that will be retained is also retained, and a file segment that is actually written to a container which will have its live file segments copy forwarded to a new container is also copy forwarded to the new container. For example, the backup/restore application writes the file segment Y to the container 150 as duplicate data, and the garbage collector copies the live file segments in the container 150 into the new container 250, and also copies the revived file segment Y in the container 150 into the new container 250. If the backup/restore application had not actually written the file segment Y to the container 150 as duplicate data, then the garbage collector would have failed to retain the revival of the previously dead file segment Y.
If the garbage collector identifies a notification to write to a container that is below the range of containers that the garbage collector is currently cleaning, the garbage collector permits deduplicating data with objects in this container. For example, the garbage collector has already cleaned the containers 180-210, is in the process of cleaning the containers 140-170 in the current batch 402, and then identifies a notification from a backup/restore application to write objects to the container 110 as deduplicated data. Since the garbage collector has not yet begun the process of cleaning the containers 100-130, the garbage collector permits the backup/restore application to write the objects to the container 110 as deduplicated data. The garbage collector applies the hash function 208 to the fingerprints of these file segments written to the container 110, and then sets the bits in the perfect hash vector 304 that correspond to the hashes for the fingerprints of the file segments written to the container 110 because the subsequent processing of container 110 will reference the bits for these file segments.
If the backup/restore application wrote file segment V to the container 110 as deduplicated data, and the container 110 already stored file segment V as a dead object, the deduplication of data would result in not writing the file segment V again to the container 110. However, the garbage collector would identify the L0 references in the write notification for the container 110, which identify the fingerprints of the live L0 data segments in the container 110, apply the hash function 208 to these identified fingerprints to create hashes, and set the bits in the hash vector that correspond to these hashes, thereby retaining the revival of the previously dead file segment V.
FIG. 4 depicts that the garbage collector cleans containers from the log head, which are the higher numbered and more recently created containers in the cleaning range, to the log tail, which are the lower numbered and less recently created containers in the cleaning range. For example, the relatively old container 100 and the relatively new container 200 both store the file segment W, and the garbage collector processes the newer container 200 first, either by retaining the newer container 200 which stores the file segment W or by creating an additional container 220 that stores the file segment W, and then resetting the bit in the perfect hash vector 304 that corresponds to the hash for the fingerprint W of the file segment W. Since the older container 100 was created before the newer container 200 was created, the older container 100 is more likely to store dead segments than the newer container 200, such that the percentage of live segments in the old container 100 is more likely to satisfy the container selection threshold for cleaning than the percentage of live segments in the new container 200. Having reset the bit corresponding to the hash for the fingerprint W after processing the newer container 200, the garbage collector processes the file segment W as a dead segment when determining whether to select the older container 100 for cleaning, which may result in the percentage of live segments in the older container 100 satisfying the container selection threshold for cleaning. In contrast, the garbage collector processing the file segment W as a dead segment for the newer container 200 would be less likely to result in the percentage of live segments in the newer container 200 satisfying the container selection threshold for cleaning. Consequently, the garbage collector cleaning the newer containers in the cleaning range before cleaning the older containers in the cleaning range is more likely to reclaim some of the storage space occupied by older containers, which would have otherwise remain inefficiently allocated.