1. Field of the Invention
Embodiments of the present invention relate to Bloom filters.
2. Background Art
A Bloom filter is a compact data structure used for probabilistic representation of a data set in order to support membership queries. Membership queries using a Bloom filter check if an element is a member of the data set. Although a Bloom filter is known to be a compact data structure, the cost of such a compact representation is a small probability of false positives which may occur when a Bloom filter sometimes incorrectly recognizes an element to be a member of a set. However in most cases, such false positives are a convenient tradeoff compared to reduced data storage size and faster lookup time when using a Bloom filter.
Bloom filters have been used in database applications to store large amounts of static data and allow reduction in the time it takes to lookup data from a slow storage device to faster main memory. They are found to be particularly useful in data management for modeling, storing, indexing, and querying data and services hosted by numerous, heterogeneous computing nodes.
Bloom filters may be compressed to improve performance. This may be particularly useful when the Bloom filter is passed as a message between computing nodes, particularly when information must be transmitted repeatedly, and its transmission size is a limiting factor. The cost associated with compressed Bloom filters includes processing time for compression and decompression.
Compressed Bloom filters must be read into memory in their entirety. This can make them unsuitable for certain applications with memory constraints. Various data compression techniques can take advantage of sparsely populated Bloom filters. However, such techniques cannot be applied to non-sparse Bloom filters as they are incompressible. Furthermore, Bloom filters, even if empty or unpopulated with data, may use large amounts of disk space, increasing not only storage costs, but also disk seek time when the Bloom filters are accessed.
In addition, user corpuses or databases storing user data may vary dramatically in size and composition, and it is not possible to estimate the size needed for a Bloom filter to store such data prior to creating one for each user. Creating a Bloom filter for each user would traditionally require building a second Bloom filter and an additional step of re-reading the entire corpus of user data into the second Bloom filter.
Present approaches of compressing, decompressing and creating separate Bloom filters not only increase the amount of processing involved but also increase the time taken for such processing.