A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. This member test can yield false positives but not false negatives. The more elements that are added to the set contained in the Bloom filter, the larger the probability of false positives. Bloom filters have a strong space advantage over other data structures, such as self-balancing search trees, tries, hash tables, or simple arrays or linked lists of the entries.
A Bloom filter is an approximate encoding of a set of items or keys using a bit vector of b bits. During encoding, the item is hashed to a number between 1 to b and the corresponding bit in the bit vector is set. To check if an item is a member of the set, the item is hashed and the status of the bit is checked. If the bit is not set, then the item is definitely not in the set. If the bit is set, then either the item is in the set or the hash value of this item collided with the hash value of some other item that is in the set. Because of hash collisions, a Bloom filter can produce false positives (the item is reported as in the set, but it is not), but it never produces false negatives (the item is in the set, but not reported).
Conventional approaches improve the effectiveness of a Bloom filter by hashing each item several times with independent hash functions. For example, k hashes are used. To encode an item x, the k bits in the bit vector that correspond to hi(x) for 1≦i≦k are set. (The same bit may be picked any number of times). To check if item y is a member of the set, item y is hashed k times using the same hash functions. The bit corresponding to hi(x) is examined to determine whether it is set for all 1≦i≦k. If any of the k bits are not set, then y cannot be a member of the set; otherwise, all k bits are set and item y is either in the set or a false positive.
Conventional Bloom filters have control points comprising the number of items in the input (n), the amount of memory (b), the number of hash functions (k), and the probability of a false positive (i.e., the false positive rate or fpr). Fixing the size of the input allows the choice of two of the other control point parameters. Memory and the number of hash functions are related. If the number of hashes is fixed and memory is increased, the false positive rate continually decreases. However, if the memory is fixed and the number of hash functions is increased, the false positive rate exhibits a minimum when an expected density (i.e., the percentage of bits set to 1) for the conventional Bloom filter is approximately 50%.
Although conventional Bloom filter technology has proven to be useful, it would be desirable to present additional improvements. A conventional Bloom filter is built and then populated with a set of items or keys. To build a conventional Bloom filter, a user has to know approximately how many keys will populate the conventional Bloom filter to know how much memory to allocate to the conventional Bloom filter. However, in many applications the number of keys is not known prior to building the conventional Bloom filter. Consequently, a user is forced to overestimate the number of keys anticipated for the conventional Bloom filter, leading to inefficient use of memory. Furthermore, inefficient use of memory may lead to a false positive rate that is less than optimum.
Conventional Bloom filters require an accurate estimate of the cardinality of the initial input set. The cardinality is the number of distinct values for a multi-set. The size of the initial input, along with the false-positive rate, determines the amount of memory allocated to encode the set. If the cardinality estimate is wrong, the false positive rate can be much higher than expected.
For example, a conventional Bloom filter may have a target false positive rate of 1/256 for an optimal 8 hashes and a target filter density of 50%. If the actual cardinality is as little as 2 times that of the cardinality estimate, the false positive rate can be 25 times what was expected. If the actual cardinality is 4 times the cardinality estimate, the false positive rate jumps to 150 times the expected value. In this case, over half of the negative results are returned as false positives and the Bloom filter is not particularly useful.
What is therefore needed is a system, a computer program product, and an associated method for generating and using a dynamic Bloom filter that self-sizes as more keys are entered in the Bloom filter. The need for such a solution has heretofore remained unsatisfied.