Spatial data describes the shape and location of objects within a space. It is estimated that spatial data is associated with a large percentage of an organization's information. For example, organizations have spent billions of dollars over the years amassing spatial data in the form of street addresses, zip codes, maps, and satellite imagery. Based on the economic value of spatial data, there is a long-felt need for the efficient processing and storage of spatial data.
Typical spatial data systems, such as geographic information systems (GIS), represent the spatial characteristic of objects with geometric primitives. For example, the location of a mailbox, automated teller machine, fire hydrant, or oil well can be represented by a point. As another example, roads, railroad tracks, power lines, oil pipelines can be represented by one-dimensional primitives such as curves, lines, and line strings, and areas such as parks, lakes, political districts, and oil fields can be represented by two-dimensional primitives such as circles and polygons.
FIG. 2 depicts an exemplary space 200 illustrating a map of a geographic area consisting of roads and parks. In the exemplary space 200, there are four roads 202, 204, 206 and 208, represented by lines, and three parks 210, 212, and 214, represented by polygons. In space 200, roads and parks may intersect. More specifically, road 202 intersects park 210, which is also intersected by road 204. Road 206 does not intersect any park, and park 212 is not intersected by any road. Finally, road 208 is the only road that intersects park 214.
A spatial join is a type of spatial query that combines two sets of objects based on their relative locations, i.e., the geometric attributes of the objects must satisfy some spatial predicate. In particular, a spatial join answers a query about which spatial objects from one set interact according to a spatial predicate with spatial objects from another set. Examples of spatial predicates include such relationships as whether two objects touch, whether one object overlaps another, and whether one object is inside another. For example, determining which roads 202, 204, 206 and 208 intersect which parks 210, 212, and 214 in space 200 involves a spatial join. In this example, data is combined to form pairs of roads and park from a set of roads and a set of parks, respectively, based on a spatial predicate of overlapping by intersection. Another example is a query that requests how many fire hydrants are within five blocks of a school.
One conventional approach to calculating a spatial join is to perform a nested spatial operation such as a nested range query, which uses the first set of objects to drive a series of range queries on the second set of objects. A range query, such as a "locate fence" query and a window query, is a spatial operation that determines whether an object interacts with, e.g. overlaps, a query range such as the interior of a polygon. In the exemplary space 200, a nested range query to find the roads that intersect the park involves first performing range queries for the road 202 to determine which of the parks 210, 212, and 214 overlap road 202. In this example, park 210 succeeds, while parks 212 and 214 fail. This process is repeated for road 204 (successful only for park 210), then for road 206 (fails for all parks 210, 212, and 214), and finally for road 208, which is successful only for park 214.
A major drawback with the nested range query approach is a lack of scalability in computation time. Processing each object in the first set has some cost, and the aggregate cost for performing all the range queries is roughly proportional to the product of this cost and the number of objects in the first set. If the cost for performing range queries of one object in the first upon each object in the second set is approximately linear, then, the total running time is O(mn), where m is the number of objects in the first set and n is the number of objects in the second set. Accordingly, conventional systems typically employ ancillary data structures such as indexes to reduce each iteration down to an O(log n) running time, aggregating to a log-linear O(m log n) running time. However, even log-linear running times quickly become unacceptable for very large data sets.
Thus, one approach to reducing the computational expense of a spatial join is to use a computationally inexpensive preprocessing step for eliminating at least some of the answers that do not satisfy the spatial join query. Only those answers that pass this preprocessing step, called a "primary filter," are submitted to the more expensive range queries and other exact spatial operation in a phase called the "secondary filter." For example, road 202 is quite distant from and does not intersect park 214. Thus, one useful primary filter would exclude this combination of road 202 and park 214 as an answer to the exemplary spatial join. A primary filter therefore is an "inexact" spatial join. By "inexact," it is meant that the result of the primary filter may contain answers that do not satisfy the spatial predicate according to the exact method. However, a valid primary filter must produce all the correct answers to the spatial join query.
Since a primary filter need not be "exact," it is acceptable for a primary filter to use approximations to reduce the computational complexity and maintain computational scalability. The computational costs of a primary filter are typically reduced by storing information in ancillary data structures such as an index. Thus, a primary filter permits the fast selection of a small number of candidate answers to pass along to the exact, and computationally more expensive methods of the secondary filter. However, it is also desirable to improve the selectivity and running time of the primary filter, yet conventional primary filters tend to degrade in terms of storage space for the indexes and compute-time requirements when higher-resolution approximation techniques are employed.