1. Field of the Invention
The embodiments of the invention provide an adaptive greedy method for fast list intersection via sampling.
2. Description of the Related Art
Within this application several publications are referenced by Arabic numerals within brackets. Full citations for these, and other, publications may be found at the end of the specification immediately preceding the claims. The disclosures of all these publications in their entireties are hereby expressly incorporated by reference into the present application for the purposes of indicating the background of the present invention and illustrating the state of the art.
Correlation is a persistent problem for the query processors of database systems. Over the years, many have observed that the standard System-R assumption of independent attribute-value selections does not hold in practice, and have proposed various techniques towards addressing this (e.g., [8]).
Nevertheless, query optimization is still an unsolved problem when the data is correlated, for two reasons. First, the multidimensional histograms and other synopsis structures used to store correlation statistics have a combinatorial explosion with the number of columns, and so are very expensive to construct as well as maintain. Second, even if the correlation statistics were available, using correlation information in a correct way requires the optimizer to do an expensive numerical procedure that optimizes for maximum entropy [10]. As a result, most database implementations still rely heavily on independence assumptions.
One area where correlation is particularly problematic is for semijoin operations that are used to answer conjunctive queries over large databases. In these operations, one separately computes the set of objects matching each predicate, and then intersects these sets to find the objects matching the conjunction. Examples of star joins, scans in column stores, and a key word search are provided.
In regards to star joins, the following query analyzes coffee sales in California by joining a fact table Orders with multiple dimension tables:                SELECT S.city, SUM(O.quantity), COUNT(E.name)        FROM orders O, cust C, store S, product P, employee E        WHERE O.cId=C.id and O.sId=S.id and O.pId=P.id and O.empId=e.id C.age=65, S.state=CA, P.type=COFFEE, E.type=TEMP        GROUP BY S.city        
Many DBMSs would answer this query by first intersecting 4 lists of row ids (RIDs), each built using a corresponding index:L1={Orders.id|Orders.cId=Cust.id,Cust.age=65},L2={Orders.id|Orders.sId=Store.id,Store.state=CA}, . . . ;and then fetching and aggregating the rows corresponding to the RIDs in L1∩L2∩ . . . .
In regards to scans in column stores, recently there has been a spurt of interest in column stores (e.g, [15]). These would store a schema like the above as a denormalized “universal relation”, decomposed into separate columns for type, state, age, quantity, and so on. A column store does not store a RID with these decomposed columns; the columns are all sorted by RID, so the RID for a value is indicated by its position in the column. To answer the previous example query, a column store will use its columns to find the list of matching RIDs for each predicate, and then intersect the RID-lists.
In regards to the keyword search, a query for (“query” and (“optimisation” or “optimization”)) is provided against a search engine. It is typically processed as follows. First, each keyword is separately looked up in an (inverted list) index to find 3 lists Lquery, Loptimisation, and Loptimization of matching document ids, and the second and third lists are merged into one sorted list. Next, the two remaining lists are intersected and the ids are used to fetch URLs and document summaries for display.
The intersection is often done via an AND-tree, a binary tree whose leaves are the input lists and whose internal nodes represent intersection operators. The performance of this intersection depends on the ordering of the lists within the tree. Intuitively, it is more efficient to form smaller intersections early in the tree, by intersecting together smaller lists or lists that have fewer elements in common.
Correlation is problematic for this intersection because the intersection sizes can no longer be estimated by multiplying together the selectivities of individual predicates.
The most common implementation of list intersection in data warehouses, column stores, and search engines, uses left-deep AND-trees where the k input lists L1, L2, . . . Lk are arranged by increasing (estimated) size from bottom to top (in the tree). The intuition is to form smaller intersections earlier in the tree. However, this method may perform poorly when the predicates are correlated, because a pair of large lists may have a smaller intersection than a pair of small lists. Correlation is a well known problem in databases and there is empirical evidence that correlation can result in cardinality estimates being wrong by many orders of magnitude, see e.g. [14, 8].
An alternative implementation proposed by Demaine et al [5] is a round-robin intersection that works on sorted lists. It starts with an element from one list, and looks for a match in the next list. If none is found, it continues in a roundrobin fashion, with the next higher element from this second list. This is an extension to k lists of a comparison-based process that computes the intersection of two lists via an alternating sequence of doubling searches.
Neither of these two solutions is really satisfying. The first is obviously vulnerable to correlations. The second is guaranteed to be no worse than a factor of k from the best possible intersection (informally, because the method operates in round-robin fashion, once in k tries it has to find a good list). But in many common inputs it actually performs a factor k worse than a naive left-deep AND-tree. For example, suppose the predicates were completely independent and selected rows with probabilities p1≦p2≦ . . . ≦pk, and suppose further that {pj} forms (or is dominated by) a geometric sequence bounded by say ½. For a domain with N elements, an AND-tree that orders the lists by increasing size would take time O(N(p1+p1p2+ . . . +p1p2 . . . pk−1))=o(p1n), while the round-robin intersection would take time proportional to
      N    ⁢                  ⁢          k      /              (                              1                          p              1                                +          ⋯          +                      1                          p              k                                      )              =      Ω    ⁢                  ⁢                  (                  k          ⁢                                          ⁢                      p            1                    ⁢          N                )            .      This behavior was also experimentally observed in [6].
The round-robin method also has two practical limitations. First, it performs simultaneous random accesses to k lists. Second, these accesses are inherently serial and thus have to be low-latency operations. In contrast, a left-deep AND-tree accesses only two lists at a time, and a straightforward implementation of it requires random accesses to only one list. Even here, a considerable speedup is possible by dispatching a large batch of random accesses in parallel. This is especially useful when the lists are stored on a disk-array, or at remote data sources.
Tree-based RID-list intersection has been used in query processors for a long time. Among the earliest to use the greedy method of ordering by list size was [11], who proposed the use of an AND-tree for accessing a single table using multiple indexes.
Round-robin intersection methods first arose in the context of AND queries in search engines. Demaine et al [5] introduced and analyzed a round-robin set-intersection method that is based on a sequence of doubling searches. Subsequently, Barbay et al [2] have generalized the analysis of this method to a different cost-model. Heuristic improvements of this method were studied experimentally on Google query logs in [6, 3]. A probabilistic version of this round-robin method was used by Raman et al [13] for RID-list intersection.
In XML databases, RID-list intersection is used in finding all the matching occurrences for a twig pattern, whose selection predicates apply to multiple elements related by an XML tree structure. [4] proposed a holistic twig join method, TwigStack, for matching an XML twig pattern. IBM's DB2 XML has implemented a similar method for its XANDOR operator [1]. TwigStack is similar to round-robin intersection, navigating around the legs for results matching a pattern.
A related but different problem, studied by Munagala et al [12], is the pipelined filters problem, which is a variant of the Min-Sum Set-Cover (MSSC). In this variant, a single list L0 is given as the “stream” from which tuples are being generated. Predicates are evaluated by scanning this stream, so they can be treated as lists that support only a contains( ) interface that runs in O(1) time. The job of the pipelined filters method is to choose an ordering of these other lists. [12] apply previous work by Feige, Lovász and Tetali [7] on MSSC, by treating the complements of these lists as sets in a set covering. They show that the greedy set cover heuristic is a 4-approximation for this problem, and also study the online case (where L0 is a stream of unknown tuples).
The difference between this problem and the general list intersection problem is that a method for pipelined filters is restricted to use a particular L0, and apply the other predicates via contains( ) only. Hence, every method has to inspect every element in the universe at least once. In this context, this would be no better than doing a table scan on the entire fact table, and applying the predicates on each row. Another difference is in the access to the lists—this setting accommodates sampling and hence estimation of (certain) conditional selectivities, which is not possible in the online (streaming) scenario of [12], where it would correspond to sampling from future tuples. Finally, the pipeline of filters corresponds to a left-deep AND-tree, while this model allows arbitrary AND-trees; for example, one can form separate lists for say age=65 and type=COFFEE, and intersect them, rather than applying each of these predicates one by one on a possibly much larger list.