A Bloom filter is a probabilistic data structure that is used to test whether a certain test element T is a member of a set S. Conventionally, lookups in Bloom filters have been treated as discrete events (e.g., dice rolls) as opposed to related events (e.g., cards pulled from a blackjack shoe). Probability and statistics make clear that a previous roll of a die has no effect on the likelihood of a certain result on a current or subsequent roll of a die. Probability and statistics also make clear that removing one card from a deck of cards does have an effect on the likelihood of a certain result when a subsequent card is pulled.
A Bloom filter may generate a false positive that incorrectly asserts that T is a member of S. However, a Bloom filter will not generate a false negative that incorrectly asserts that T is not a member of S. The ratio of false positives produced by a Bloom filter varies directly with the number of elements in S and varies inversely with the size of the Bloom filter. Therefore, a conventional approach to limit false positives has involved increasing the size of the Bloom filter and limiting the number of items for which entries are placed in the Bloom filter. Both of these approaches have significant drawbacks as the former increases the amount of memory required, and the latter decreases the usefulness of the filter.
A traditional Bloom filter uses 1.44 log2(1/e) bits of space per inserted key, where e is the false positive rate. A hypothetically optimal probabilistic data structure would only require log2(1/e) bits. Regardless of whether a hypothetical or traditional filter is used, the false positive rate varies directly with the number of entries in the filter.
Bloom filters have been employed in dedupe to facilitate quickly ascertaining whether a data sub-block currently being processed is already stored by a data deduplication (dedupe) application. Rather than doing an index lookup, which may involve disk access, a first step may be to consult a Bloom filter to determine whether to bother doing the index lookup. A Bloom filter can give a definite “no the entry is not in the index” but cannot give a definite “yes the entry is in the index” answer. If the data sub-block is definitely not stored, then the index will not be accessed. But if it is possible that the data sub-block is stored, then the index may be accessed. While a Bloom filter may be small enough to fit in memory, the index may be too large to fit in memory.
One of the goals of data deduplication (dedupe) is to reduce data storage. Dedupe applications typically store data sub-blocks in one location and store information (e.g., hash, key, location) about the stored data sub-blocks in another location (e.g., index). Rather than search through stored sub-blocks, a dedupe application may instead search an index. As deduped data sets become very large, an index used to locate and/or identify the presence of data sub-blocks may become too large to store in memory. Thus, at least a portion of a dedupe index may be stored on a secondary storage device (e.g., disk, tape). However, these secondary storage devices may be unacceptably slow for sub-block indexing. Therefore yet another layer of data structures may be added to a dedupe application. The additional data structures may include, for example, a Bloom filter. The Bloom filter may store information that identifies entries stored in the index. The Bloom filter would be small enough to reside in memory, rather than on a secondary storage device, and therefore doing an index lookup in a Bloom filter may be faster than doing a corresponding index lookup in an actual index.