Probabilistic data structures in general, and Bloom Filters (BFs) in particular, are nowadays used in a wide variety of important network applications, thanks to their ability to summarize large amounts of information in a compact way while still allowing fast queries and updates. BFs (see for reference Bloom, B. H. “Space/time trade-offs in hash coding with allowable errors”, in Communications of the ACM, vol. 13, no. 7, July, 1970, p. 422-426) are used both to store local information which needs fast lookups (e.g. for routing, filtering, monitoring, deep packet inspection DPI, intrusion detection systems IDS, etc.) and to export data. In distributed databases or peer-to-peer systems BFs are often used to efficiently export summaries of the resources available on each node.
However, standard BFs only support membership queries, and are therefore not expressive enough for many applications. An extension to BFs called Counting Bloom Filters (henceforth CBF), as described for instance in L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary Cache: A Scalable Wide-Area (WEB) Cache Sharing Protocol”, in IEEE/ACM Transactions on Networking, 8 (3):281-293, 2000 provides a more flexible data structure that can support item deletion and approximate counting. In C. Estan and G. Varghese, “New Directions in Traffic Measurement and Accounting”, in Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement a similar data structure is used for detecting flows passing a certain threshold. However, while BF summaries generated by multiple sources can be easily aggregated with no information loss by performing a bit-wise “or”, CBFs are not linear with respect to aggregation, thus making them less appealing for many network applications.
Other solutions have been proposed to enhance the expressiveness of BFs in order to essentially use them as packet counters (see for instance M. Durand and P. Flajolet, “Loglog counting of large cardinalities,” in ESA03, volume 2832 of LNCS, 2003, pp. 605-617). However, these solutions are still based on a “flat” one dimensional key space and cannot be used, for example, to take trace of relationships among tuples (e.g., correlating packets belonging to different flows but the same application). Further, they do not support distinct counting, in that they cannot avoid the same packet to be accounted for several times; the same holds true for other data structures, such as sketches, which are commonly used for various network applications, or, more particularly, counting sketches, which are used to summarize large vectors of data.
Finally, in Muhammad Mukarram Bin Tariq, “Tuple Set Bloom Filter”, Georgia Tech., presentation Apr. 26, 2006 the author proposes a solution to use a BF-based data structure which supports approximate tuple queries with undefined attributes. The approach uses a bit matrix in which each row is associated with one of the attributes of the tuple. Upon each element insertion, a different set of K-independent hash functions is selected out of a total set H, using it to set the bits of the map on each row as in a standard Bloom Filter. Upon membership query, a particular lookup matrix is used and the hash value issued by every function in H is computed over the input attribute values. The query returns a positive result if K hash functions exist that address a set bit on each row. In order to perform wildcard queries, the rows associated with the undefined attributes can be simply skipped. This data structure, however, does support neither a cardinality estimation query nor a threshold trespassing query. Moreover, it can return only a Boolean value as a response and is thus not suitable for counting.
It is therefore an object of the present invention to improve and further develop a method and a system for probabilistic processing of data of the initially described type in such a way that an efficient summary of data is realized while at the same time a high expressiveness with regard to the kinds of queries that can be performed on the data is provided.