Input and output (I/O) limitations can be a critical aspect in achieving acceptable system performance in many large scale applications such as those arising in VLSI and CAD design, spatial databases and geographic information systems. Even though CPU speeds have, over the past decade or so, increased at an annual rate of 40-60%, disk transfer rates have only increased 7-10% annually. Increases in internal memory sizes have not been enough to keep pace with these large applications which generate enormous amounts of data.
Data will, therefore, not always fit in internal memory. Secondary or external memory is therefore needed to accommodate the large blocks of produced data. The successful realization of any data model in a large-scale database requires supporting its language features with efficient secondary-storage manipulation. Along with retrieval, the join is one of the most I/O-intensive operations in database systems. As the join operation is also one of the fundamental relational database query operations, a considerable amount of research has been undertaken in an attempt to develop I/O efficient join techniques in the relational database model.
The join operation facilitates the retrieval of information from two different sets based on the Cartesian product of the two sets. Specifically, the join operation is used to combine tuples (rows of information) from two or more sets based on common information. Efficient implementation of the join operation is difficult since no predefined links between the sets are required to exist.
In this context, the standard measure of efficiency for a join operation is the number of I/O operations performed by the technique in question. Moreover, it is important to develop techniques that have provably good worst-case bounds and that are also efficient in practice. The most efficient technique has a lower bound of .OMEGA.(n log.sub.m n+t) I/O operations or running time. This is implied by the well-known lower bound for sorting in external memory, where B is the units of data per disk block, M is the total amount of main memory, N is the units of data in the problem, T is the units of data in the solution, n=NIB, t=TIB, m=M/B and ##EQU1##
See A. Aggarwal & J. S. Vitter, "The I/O log m Complexity of Sorting and Related Problems," Communications of the ACM 31(9) (1988), pps. 1116-1127.
The general data structure problem in many data models is the storage and manipulation of d-dimensional rectangles. The term rectangle is used in a general sense to denote intervals in one dimension, rectangles in two dimensions and hyper-rectangles in d dimensions. It is also assumed that the sides of the rectangles are parallel to the coordinate axes. As such, indexing or retrieval in many data models reduces to d-dimensional range searching over d-dimensional rectangles. The join in many data models can be defined as the intersection between two sets of rectangles in d dimensions. Although other definitions of the join are possible, based on inequalities, dominance, or proximity, the intersection-based join problems are good representatives of join problems.
In one dimension, the join is simply the set of intersections between two sets of intervals. This problem is the prototypical join problem in temporal and constraint data models. The prior art techniques used to solve the interval join problem suffer various drawbacks.
On-line interval intersection occurs where a data structure is built on an input set of intervals and then queried (or updated) in an on-line fashion. That is, the results of a query have to be returned before the next query is processed. This problem has been extensively studied both in main memory and secondary storage. In particular, the open problem of whether it is possible to build a dynamic, worst-case optimal data structure for this problem was recently resolved. See L. Arge & J. S. Vitter, "Optimal Dynamic Interval Management in External Memory," IEEE Symp. on Foundations of Comp. Sci. (1996). However, directly applying the on-line intersection technique to the one-dimensional join problem results in a running time of O(N log.sub.B n+t) I/O operations, which is a factor of .THETA.(B (log m)/(log B)) away from the optimal.
Off-line interval intersection, where a stream of queries is submitted to a data structure which processes them in a "lazy" fashion, has also been studied. Directly applying the off-line data structure to the one-dimensional join problem results in a technique that is optimal with respect to the running time, but uses a non-optimal O(n log.sub.m n) disk blocks of storage.
An asymptotically optimal (but somewhat impractical) method for the one-dimensional join problem can be obtained by using the reduction of interval intersection to two-dimensional range searching and by using the batch two-dimensional range query techniques. On the other hand, the approaches in the temporal database literature for interval join are based on heuristics and do not offer good asymptotic worst-case bounds.
In two dimensions, the join between two relations is the intersection between rectangles in the plane. This join problem has a very elegant solution in main memory that uses priority search trees and plane sweeping to achieve an optimal running time. In secondary storage, this problem can be solved in an asymptotically optimal way by reduction to the problems of line segment intersection and batched range searching. This solution, however, in practice will be much more inefficient than that of the methodology of the invention as described below.
The general problem of finding intersections between two sets of d-dimensional hyper-rectangles has also been studied. Specifically, the prior art techniques focus on efficient internal-memory methods for reporting intersections between rectangles in d-dimensional space. The fastest currently known internal-memory method runs in O(Nlog.sub.2.sup.(d-1) N+T) time and linear space, and several earlier results exhibited some additional logarithmic factors in time as well as in space.