The need to determine set membership is encountered in many computer applications. For example, in many cases it is desired to determine whether an element, x, is a member of a set, Y, wherein x and Y are each provided by a domain, D. If the elements of D have a simple representation and if Y is small, a simple approach to set membership testing can be taken. Namely, list all representations of elements of Y in an array, A, having length |Y| and then, given an element x∈D, compare the representation of x against every entry of A. Unfortunately, testing set membership using this simple method is inefficient in a variety of situations, particularly when Y is very large. For this reason, the set membership problem is often solved by first querying a filter. A filter is a mathematical object that can be queried with an element, returning an indication of either maybe or no. Maybe is interpreted by the user as a possible presence of the element x in the set Y and no is interpreted as definite absence of the element x in the set Y. Observe that by contrast, in the simple approach, the indication returned is either positive or negative. With the simple approach, a positive indication is interpreted as a definite presence of the element x in the set Y and a negative indication is interpreted as a definite absence of the element x in the set Y. Unlike the simple approach, however, a filter admits false positives. As a result, when using a filter, a secondary test (such as the simple approach, for example) is sometimes used to further investigate the elements which returned an indication of maybe.
The purpose of a filter, therefore is to provide an efficient primary test for set membership. The amount of space used to store a filter for Y is ideally far less than the space necessary to store Y, and the time required to query a filter is ideally far less than the time required to query Y, even in the case where Y has some natural order and intelligent search methods can be used. The trade-off for the decrease in time and space is that, as mentioned previously, the answers returned by the filter are imprecise. In some instances, an element that passes the filter may require a costly secondary test, but in many instances this secondary test is unnecessary.
Bloom Filter
An example of a well-known filter for testing probabilistic set membership is the Bloom filter. Use of a Bloom filter generally works as follows. Let D be any set (the domain), let Y⊂D with m=|Y|, and let the memory available for the Bloom filter BY be n bits. Next, a hash function, h, is selected that maps the elements of D uniformly at random into the range [0, n]. All of the bits of BY are initialized to 0. Next all of the elements of y∈Y are stored into BY. To store each element y, set the bit at index h(y) to 1, i.e., BY[h(y)]=1. To store all of Y, store all the elements of Y in turn.
Once the filter is built, the filter may be queried. To query the filter BY with an element x∈D, check if the bit at index h(x) is set to 1. If so, the filter provides a maybe indication. If the bit is set to 0, then x∉Y and the filter produces an indication no.
An example of a more typical Bloom filter provides several hash functions h1 . . . hk. To store an element y∈Y, each hash function h1(y) . . . hk(y) is computed, thereby converting the element x to an array of positions. The bits of BY are then set to 1 at each of the positions in the filter associated with the array. To query the typical Bloom filter with x∈D, each hash function h1(x) . . . hk(x) is computed thereby converting the element to an array of positions. The bits of BY at all of the positions in the filter associated with the array are checked. If any of the bits at the associated positions are set to 0, the filter provides an indication of no and x is rejected; i.e. it is determined that x∉Y. Otherwise, if all of the bits at the associated positions are set to 1, the filter provides an indication of maybe.
In order for this typical Bloom filter to work effectively, n and k must be chosen appropriately for a given m. If n is too small or k is too large, most or all of the bits of BY could become set to 1. This means the filter will rarely reject when queried, i.e. the filter will almost always provide an indication of maybe, resulting in a high number of false positives. If the number of false positives becomes too high, the filter BY will be rendered useless.
When measuring how well a particular filter construction performs, the distinction between the filter construction algorithm and a particular filter instance output by the algorithm is noted. For example, given a filter instance F, the false positive rate isp(F)=P[F(x)=maybe|x∈D\Y].In other words, the false positive rate p(F) of F is the probability that F passes an element erroneously. For certain inputs, a filter construction algorithm might output a specific filter instance F with a higher or lower false positive rate. To measure the quality of a filter construction, it is necessary to compute an appropriate average of the false positive rate of filter instances. For a given input set Y, let F(Y) be the filter instance, and define the false positive rate of the filter construction under load m to bep=P[p(F(Y))|⊂D,|Y|=m].That is, the false positive rate of a filter construction is the probability that a filter instance F(Y), built from a uniform random input of size m from the domain D, erroneously accepts an element from D chosen uniformly at random.
When comparing filter constructions, it is standard to assume that |Y|<<|D|; that is, that the number of input elements |Y| is insignificant compared to the total number of elements |D|. Further, it is standard to assume that elements are queried from D uniformly at random. This is the case in essentially all applications, and it simplifies the computation of the false positive rate, since the false positive rate becomes simply the positive rate. In addition, the memory available for each element is far less than the memory needed to represent them perfectly, which is typically assumed to be infinite. This avoids spurious degenerate situations that complicate an analysis, such as having enough memory to simply store Y, resulting in a false positive rate of 0.
If the filter is given a lot of memory, it should have a low false positive rate. So, it is necessary to measure not just the false positive rate of a filter construction, but the efficiency, i.e. how well a filter uses the memory available to it. Information theory provides boundaries on how effective a filter can perform. This is in terms of how much memory the filter uses for a desired false positive rate on a given input set. Given a filter with false positive rate p, n bits of memory, and m=|Y|, the well-known measure of efficiency, ∈, is provided by the following formula:
  ɛ  =                              -                      log            2                          ⁢        p                    n        /        m              .                  where,∈≦1which represents the information-theoretic limit.        
The numerator of ∈ measures the bits of cut-down. For example, if the filter has a false positive rate of ⅛ then it has 3 bits of cut-down. Whereas the denominator is the number of bits of memory available to represent each item in the filter. Intuitively, some number of bits are used to specify each element y∈Y. For example, if there are n=3 bits available to the filter and there are m=6 elements of Y, the filter has half a bit of information available to store each y. From an information-theoretic perspective, it is reasonable to conjecture that the maximum possible cut-down of such a filter is half a bit for a false positive rate of 2−½.
In practice, there are several important factors to consider when choosing a filter construction. In some cases, the one-time work of filter construction may need to be done quickly. In other cases, the every-time work of querying the filter may need to be done quickly. In still other cases, there is a need to provide a very efficient filter; i.e. one which minimizes the size of the filter for a given false positive rate. For these situations, it is noted that even if the memory required for such filters (for example, filters used for virus definitions or malicious website blacklists) is relatively small, a small reduction in memory (e.g. a few megabytes) can save millions of megabytes of bandwidth when the filter is provided to millions of users. In yet, other cases, there is a need to minimize the false positive rate for a given amount of memory.
Bloom filters can achieve an efficiency of at most ln 2 (ln 2≈0.693). Thus, Bloom filters are not very close to achieving the information-theoretic limit of 1 and are therefore, not very memory efficient. Although Bloom filters require little time to construct and little time to query, their achieved efficiency is limited. Compressed Bloom filters have therefore been utilized to improve the efficiency of a traditional Bloom filter. These compressed Bloom filters function similar to traditional Bloom filters, however once built, the filter is compressed in order to reduce the memory necessary for storage of the filter. Although these compressed Bloom filters come closer to achieving the information-theoretic limit than traditional Bloom filters, because the filter must be compressed after building and must be decompressed during the query phase, the improved efficiency achieved by these compressed Bloom filters comes at the expense of both more one-time work and more every-time work. Other methods, such as the method proposed by Pagh in “An Optimal Bloom Filter Replacement” or the method proposed in “An Optimal Bloom Filter Replacement Based on Matrix Solving”, have been devised in an attempt to provide a more efficient filter. Each of these methods, however, save memory by sacrificing query time.
Satisfiability
Finite domain constraint satisfaction problems have been used in various applications. Constraint satisfaction problems (aka SAT instances) are encoded as conjunctions of Boolean equations. For example, a SAT instance may be expressed as follows:χ=C1^ . . . ^Cm,where the symbol ^ represents logical conjunction (AND) and each Ci, 1≦i≦m, is described by a Boolean function B, i.e. an expression of the formB(li,1, . . . , li,ki),where each l is a literal, i.e., a Boolean variable or its negation (NOT). The width of the equation Ci is k if Ci has exactly k distinct literals and no pair of literals is complementary. A pair of literals is said to be complementary if both are the same variable but have different signs, i.e., xi and xi, are complementary literals, Specifically,
            l      i        _    =      {                                                      x              i                        _                                                              if              ⁢                                                          ⁢                              l                i                                      =                          x              i                                                                        x            i                                                              if              ⁢                                                          ⁢                              l                i                                      =                                          x                i                            _                                          
An assignment v is a function from the set of variables Vars (|Vars|=n) into the set Bool, i.e. {0, 1}. An assignment v satisfies a variable xi if v(xi)=1 and v satisfies xi if v(xi)=0. An assignment v satisfies an equation Ci if B(v(li,1), . . . v(li,ki))=1 and satisfies a SAT instance, χ=C1 ^ . . . ^ Cm, if v satisfies all Ci, 1≦i≦m. A satisfying assignment for χ is also called a solution. To give an example using a concrete Boolean function, if B is am expression of the formLi,1v . . . vli,ki,where the symbol V represents logical disjunction (OR), then an assignment v satisfies Ci if for some j, 1≦j≦ki, v(li,j)=1. A random k-SAT instance is a conjunction of equations drawn uniformly, independently, and with replacement from the set of all width k equations.
The equation Ci can be thought of as a constraint on a putative solution. Therefore, a collection of equations, i.e. a SAT instance χ, can be thought of as a conjunction of constraints on a putative solution. Given a random k-SAT instance χ, the strength of χ (as a conjunction constraints) can be measured in terms of the ratio αχ=m/n. The strength of each constraint indicates how easy or difficult the constraint is to satisfy. The strength of each constraint depends only on its length, k. Thus, intuitively, constraints of equal length represent the same “strength”. Random k-SAT instances exhibit quite regular behavior in terms of the equations-to-variable ratio (i.e. m/n). This ratio determines with high probability the satisfiability of the set of equations drawn. Specifically, given a fixed k there exists a number αk such that whenever αχ<αk then χ is almost certainly satisfiable, and whenever αχ>αk then χ is almost certainly unsatisfiable. Thus, αk provides a threshold which defines the boundary between satisfiable and unsatisfiable instances. The ratio m/n can be selected, therefore, to ensure that the instance is satisfiable.
FIG. 1 provides a graph illustrating the relationship between the ratio m/n and the probability of solving a SAT instance. The ratio αχ is provided on the horizontal axis of the graph and the probability of solving a SAT instance is plotted on the vertical axis of the graph. As illustrated, with m/n near 0, the probability of solving a SAT instance is high. As m/n increases, the probability of solving a SAT instance remains high until a threshold, αk, is reached. Once m/n exceeds this threshold, the probability of solving a SAT instance is greatly reduced, i.e. SAT instances transition from satisfiable to unsatisfiable. Thus, the threshold αk defines a satisfiable region for those instances where αχ <αk, an unsatisfiable region where αχ>αk, and a transition region which lies between the satisfiable region and the unsatisfiable region.
In addition to these theoretical results that prove the bound on the growth of αk but do not provide its closed form, experimental results have established values of αk for some specific B and small values of k. For example, when the Boolean function B is disjunction, as in the example above, the following values for αk provided in Table 1 have been determined for each of the following values of k.
TABLE 1Random k-SAT phase transition for various k.k1234567αk 014.269.9321.1143.3787.79
Thus, as k increases, the threshold αk increases, thereby increasing the ability to satisfy the SAT instance. As noted in the table above, for example, for k=3 the threshold is reached when the ratio m/n reaches≈4.26.
There is a need, therefore, for a filter that can be queried quickly like the Bloom filter but which provides greater efficiency than the Bloom filter or the compressed Bloom filter. Although domain constraint satisfaction problems provide methods for determining if the variables of a given Boolean formula can be assigned in such a way as to satisfy a formula, they do not provide a filter which allows set membership to be determined.