1. Field
The invention relates to the field of computer field of data processing. More particularly to a program product, method, and system for computing set intersection of a first and a second unordered set of discrete members utilizing an acceleration unit. The present invention may be used for performing query processing in a database.
2. General Background
This invention focuses on the problem of computing set intersection. That is, given two sets of numeric identifiers, the task is to compute a third set that contains all identifiers that occur in both the first and the second set.
Set intersection is a basic problem in computer science, and has a variety of applications. For example, Index-ANDing in database query evaluation calculates set intersection of property index tables. Also hit list joining in email archiving calculates set intersection. Hit list joining is used by IBM® Enterprise Content Management (ECM) software, for example, to check access rights of users against access rights of documents. (IBM is a registered trademark of International Business Machines Corporation.) This is also an instance of the set intersection problem, and the long run times of the current software implementation are limiting the scalability of the software in terms of number of concurrent users.
There are two established classes of methods for computing set intersection. For this discussion, no order among the members of a first, second, or third set is assumed. It is further assumed that n denotes the combined number of elements in the first and second input set (S1 and S2 respectively), i.e. n=|S1|+|S2|.
Sort-merge based methods generally proceed in two phases. The first and second sets are sorted, and then a linear pass is made over both the sorted first set and the sorted second set, and the sets are merged into a third set that contains the (sorted) set intersection result. The runtime performance of sort-merge based methods is in the order of O(n_log n), due to the sorting phase. The method can be implemented in-place, i.e., no additional memory is required for auxiliary data structures. The third set can be output as it is being computed, so no additional memory is required for buffering the output. In database query processing, the sort-merge-join is an operator from this class of methods.
Hash based methods or Bloom filter based methods employ a hash function that maps input values to a fixed interval of output values. The values computed by the hash function are treated as addresses to slots in a hash table. With Bloom filter based methods, the output values of the hash function address bits in a bit vector. Since both approaches are similar, only the case of hash based methods is discussed.
This class of methods generally proceeds in two phases. In the build phase, the hash function is applied to all members of the first set. Entries are made into the respective slots of the hash table. In the probe phase, the hash function is applied to all members of the second set. By looking up the respective slots in the hash table, a case distinction is required. If the respective slot in the hash table is empty, then the member is only part of the second set, but not the first set. It is therefore not part of the set intersection. If the respective slot of the hash table is not empty, then the member may be part of the set intersection output. Since hash values can produce collisions, i.e. different input values may be mapped on to the same output value; collision resolution is required to determine whether the member from the second set is indeed identical to a member of the first set. This requires maintaining a mapping between entries in the hash table and members of the first set. Collision resolution thus determines whether the member of the second set is indeed part of the intersection output, i.e., is a true positive, or not, i.e., is a false positive. The average runtime performance of this class of methods is linear to the size of the input, yet the runtime may deteriorate to O(n*n) in the worst case, when the hash function mainly produces false positives. Regarding memory consumption, hash-based methods cannot be implemented in-place, as they require additional data structures. In particular a data structure representing the hash table, and means for providing the mapping from entries in the hash table back to the original values for collision resolution are required. In database query processing, hash-join operators or Bloom filters belong to this class of methods.
The discussed solutions are established software solutions for computing set intersection. Since they are CPU-intensive, performance may be improved by offloading this computation onto an accelerator board, such as an FPGA board attached to a PCI Express bus. In a basic scenario an accelerator board is attached to a host computer via a communication bus. The accelerator board contains accelerator hardware, and typically some on-board memory such as banks of SRAM and DRAM. Typically, the on-board memory forms a memory hierarchy, for instance where SRAM is smaller in capacity but faster to random access, and DRAM is larger in capacity but has longer random access times.
The optimization goals in this scenario are runtime efficiency and memory efficiency. The optimization of runtime efficiency accelerates the runtime of the system, and the optimization of memory efficiency makes efficient and parsimonious use of the available memory resources, as this directly affects the size of the input that can be offloaded to the accelerator board. The memory efficiency comes into play since offloading scenarios operate in a three-phase approach, of first sending all required data onto the accelerator board, then performing the computation on the board, and finally sending data back from the board to the host computer.
The on-board memory is limited, and fetching additional data during the computation phase can be prohibitively expensive in terms of communication latency. This implies that the capacity of the on-board memory is to be leveraged to the fullest, to maximize the size of the input that can be handled.
With these goals in mind, the drawbacks of naively transferring the existing software solutions to accelerator boards are that the sort-merge based approach has a high runtime complexity when compared to the average runtime of hash-based approaches. On the other hand the hash based approaches cannot be implemented in-place, and are thus not as efficient in their memory usage as the sort-merge based approach.
In the Patent Publication U.S. Pat. No. 7,720,806 B2, “SYSTEMS AND METHODS FOR DATA MANIPULATION USING MULTIPLE STORAGE FORMATS” by Piedmonte systems and methods for storing and accessing data are disclosed. Algebraic relations may be composed that each defines a result equal to a requested data set. The algebraic relations may reference other data sets in storage. Some of the data sets may contain the same logical data stored in different physical formats and/or in different locations in the data store. One of the algebraic relations may be selected for use in providing the requested data set based, at least in part, on the physical format and/or locations of the data sets referenced in the algebraic relations. In other examples, algebraic relations may be selected based, at least in part, on the speed and available bandwidth of the channel(s) used to retrieve data sets referenced in the algebraic relation. Functions may be used to calculate the algebraic relation using the data sets retrieved from storage. The functions may be specifically formatted based on the physical formats of the data sets provided as operands to the functions. Example embodiments may include a data store for storing data sets, a data set information store for storing information regarding the data sets, an algebraic relation store for storing algebraic relations between data sets, an optimizer for using the algebraic relations to optimize storage and access of data sets from the data store and a set processor for calculating algebraic relations to provide data sets. In example embodiments, modules may be provided by a combination of hardware, firmware and/or software and may use parallel processing and distributed storage in some example embodiments.