1. Field of the Invention
The present invention relates to techniques for monitoring a data stream. More specifically, the present invention relates to a method and apparatus for determining whether a data element was observed in a data stream.
2. Related Art
Advances in semiconductor technology have led to increases in computing power, which in turn has led to a rapid growth in the rate that data is generated. The increased rate at which data can be generated has made it more difficult to process data in real time. One such real-time data-processing task is to determine whether a data element was observed within a specified time period in a stream of data. For example, a network operator may desire to query a network router to determine whether a source address was observed within a specified time period. Unfortunately, a router has a high throughput and therefore processes network packets at a fast rate. Hence, a fast and an efficient pattern matching technique is desirable.
One such technique uses Bloom filters to monitor the data stream. A Bloom filter is a bit array of m bits into which n keys, {a1, a2, . . . , an} ε A, are mapped by k hashing functions, h1, h2, . . . , hk. For example, FIG. 1A illustrates bit vector 102 used in a Bloom filter, wherein bit vector 102 includes m elements.
For each element in set A, the k hashing functions generate k bit positions within the bit vector. For example, if there are five hashing functions, five bit positions will be generated for each element in set A. Next, the elements of the bit vector at each of these bit positions are set to 1 (i.e., Boolean true) to indicate that the element in set A was observed.
FIG. 1B illustrates an exemplary Bloom filter 104 with 16 elements (i.e., m=16) after a first data element is observed. Furthermore, three hashing functions (i.e., k=3) are used to generate bit positions for bit vector 104. Note that prior to recording data into bit vector 104, all elements in bit vector 104 are initialized to a Boolean false (i.e., 0). For the sake of clarity, only the elements in bit vector 104 that are of interest are filled in with a value; the blank elements in bit vector 104 are set to 0 (i.e., false).
In this example, a first data element is received and the three hashing functions generate the three bit positions 2, 7, and 15. The elements of bit vector 104 that corresponds to bit positions 2, 7, and 15 are then marked with a Boolean true (i.e., 1) to indicate that the first data element was observed.
To determine whether an element exists in set A, the hashing functions are used to generate bit positions for the bit vector. If all of the elements in the bit vector corresponding to these bit positions are set to 1, then the element exists in set A (with a probability of a false positive match). However, if any element corresponding to these bit positions are set to 0, the element does not exist in set A.
FIG. 1C illustrates the exemplary Bloom filter 104 of FIG. 1B after a second data element is observed. When the second data element is observed, the three hashing functions generate the three bit positions 6, 11, and 12. Next, the elements of bit vector 104 corresponding to bit positions 6, 11, and 12 are marked to indicate that the second data element was observed.
Unfortunately, the bit positions that are generated by the hashing function for one element in set A can overlap a subset of bit positions generated by the hashing function for another element in set A. Hence, a given bit position can be set multiple times. As a result, when a query is made on the Bloom filter to determine whether an element exists in set A, the Bloom filter can produce a “false positive.” Note that the Bloom filter can be tuned to reduce the possibility of generating a false positive. This is typically achieved by increasing the size of the bit vector.
Unfortunately, as more data is recorded into bit vector 104, the Bloom filter starts to fill up, and the number of false positives increases until the theoretical maximum false positive rate is reached based on the properties of the Bloom filter (i.e., m, n, and k). At this point, it is desirable to remove old data from the Bloom filter. However, the possibility of generating overlapping bit positions for different elements also makes it undesirable to remove a single element from the Bloom filter. For example if one of the hashing functions generates bit position 5 for element a1 and one of the hashing functions also generates bit position 5 for element a4, there is no way to remove element a1 from the Bloom filter without also removing element a4. Hence, removing elements from the Bloom filter in this manner increases the false negative rate.
The only reliable technique for removing elements from the Bloom filter is to clear the entire Bloom filter (i.e., set all elements of the Bloom filter to 0). Unfortunately, if the entire Bloom filter is cleared periodically, a gap in the data arises. For example, if a network operator determines that a particular source address was used in an attack, the network operator may desire to query the router to determine the source of packets used in the attack. However, if these packets were recorded just before the Bloom filter on the router was cleared, and the network operator queries the router after the Bloom filter on the router was cleared, the information about these packets is lost, and the system generates an incorrect response.
Hence, what is needed is a method and an apparatus for determining whether a data element was observed in a data stream without the problems described above.