Data de-duplication (dedupe) systems can experience situations where a unique block of data that was stored is no longer needed because no entity is referencing that unique block of data. The referencing entity (e.g., file, object) may, for example, have been deleted. Dedupe systems can also experience situations where a reference to a block of unique data that should be present in a set of stored unique data can go unsatisfied because the block of unique data is no longer available. The block may, for example, have been damaged or inadvertently deleted. The first situation results in wasted memory. Conventional garbage collection systems may be employed to reclaim this memory. The second situation results in potentially corrupted files and file read failures. Conventional reference checkers may be employed to locate, report on, and/or repair unresolved references.
Unfortunately, conventional garbage collection systems and conventional reference checkers may consume unacceptable amounts of memory and/or time to perform their functions. Unacceptable time and/or memory may be consumed because conventional systems and methods may do list-to-list comparisons after building and sorting lists. Lists may be created in memory to prevent disk i/o for each lookup of an item on the list. For example, a conventional garbage collection system may acquire a list of all the blocks of data stored by the dedupe system and may also acquire a list of all the blocks of data referenced by a referencing system (e.g., file system). Comparing unsorted lists may be computationally unfeasible. Therefore, conventional garbage collection systems may sort these lists so that list-to-list comparisons can be performed in a relevant timeframe. Similarly, conventional reference health checkers may acquire a list of all the blocks of data stored by the dedupe system and may also acquire a list of all the blocks of data referenced by a referencing system (e.g., file system). Conventional reference health checkers may then sort these lists so that list-to-list comparisons can be performed in a relevant timeframe.
In a dedupe system, data is processed to identify and store the unique data present in a source (e.g., file, data stream). Redundant occurrences of data may be replaced with references to a single stored copy of the unique data. When the references are smaller than the unique data, then savings in storage space may be achieved. When the references are substantially smaller than the unique data blocks, and when there are multiple references to a unique data block, then significant savings in storage space may be achieved.
However, as dedupe systems are used, some of the unique data may no longer be needed. For example, the last file that references a piece of unique data may be removed from a referencing source (e.g., file system) and thus the piece of unique data may no longer be needed. Undesirably, the piece of unique data may remain in the data store where unique data is stored, even though it is unreferenced. This is undesirable because unused data is consuming storage space that could be used for other unique data that is actually referenced. Conventional garbage collection systems may attempt to reclaim the space associated with the unreferenced unique data. To reclaim this space, the unreferenced unique data must first be identified. In one example, a list of all references is compared to a list of all unique data. Unique data for which there is no reference may be deleted. While conceptually simple, this task may be resource and/or computationally complex.
In one example dedupe system, unique pieces of data may be referred to as blocks. In this example, a block pool may store unique blocks that may be individually accessible. A block may be assigned a unique identifier (e.g., a block tag). The block pool may maintain a list of unique blocks. In another example dedupe system, although files or data streams may ultimately be subdivided into blocks, files or data streams may first be subdivided into binary large objects (BLOB). A BLOB may then be subdivided into a number of blocks. BLOBs may be employed to refer to sets of blocks because the number of blocks may become too large to track individually in a practical manner. In this example, blocks may not be stored individually but may be stored along with other blocks found in the same BLOB. A BLOB may be assigned a unique identifier (e.g., a BLOB tag). A BLOB may include a list of blocks stored in the BLOB. In this example, while a block pool may store blocks, it actually stores BLOBS in which blocks are stored. In the first example, a list of unique blocks may be directly available. In the second example, a list of BLOBs may be directly available and a list of unique blocks may be accessible through the list of BLOBs.
Conventional garbage collection processing may compare the entire list of blocks associated with a storing entity (e.g., block pool) to the entire list of block tags referenced by a referencing entity (e.g., file system). Conventional garbage collection processing may, additionally or alternatively, compare the entire list of BLOB tags associated with a storing entity (e.g., BLOB pool, block pool) to the entire list of BLOB tags referenced by a referencing entity (e.g., file system). BLOBs for which there are no references may be reclaimed. At a finer granularity, blocks for which there are no references may be reclaimed. References that cannot be satisfied may trigger error processing including, for example, reporting an error and/or attempting to correct the error.
Conventional systems may perform either garbage collection or reference checking using list to list comparisons. Consider a list of referencing entities associated with a file system. This list may include hundreds of millions (1×108) entries. Let α=the number of entries in the referencing list. Consider also a list of referenced items. In a large enterprise, this may include hundreds of millions (1×108) of items. Let β=the number of referenced entities. Thus, when the lists are unsorted, finding any given entry from the reference list in the referenced items list would consume O(β/2) time. This comparison would need to be performed a times. O(α*β/2) may be an unacceptable amount of time.
Therefore, conventional garbage collection or reference checking systems and methods typically sort the lists before performing the comparisons. In one example, the amount of time required to sort the two lists would be O(α log α)+O(β log β). While sorting the lists reduces the amount of time required to compare the lists (e.g., (O(α log α)+O(β log β))<<(α*O(β/2))) sorting the lists may produce an additional issue of unacceptable or undesirable memory usage.
For garbage collection and/or reference health checking, the size of the referencing information to be sorted and the size of the block information to be sorted may produce both complexity and memory issues. However, in practice, memory issues tend to be larger concerns. When the amount of memory available for sorting is less than the amount of memory required to hold the entire set of data to be sorted, then sorting may experience i/o slowdowns due, for example, to disk i/o.