In a commercial database system, the optimizer component is responsible for picking the best strategy for solving a Structured Query Language (SQL) query. The optimizer typically does this by examining a number of possible search strategies, and then picking the strategy that can be executed in the shortest amount of time. A key ingredient that enables the optimizer to perform this task is the availability of statistics that help the optimizer understand the number of database rows, and underlying database storage blocks, which must be manipulated in order to resolve a given query.
Although the collection of statistics is not supported by any of the SQL standards, most Relational Database Management System (RDBMS) vendors include support for some form of statistical collection on the data stored within their databases. A RDBMS vendor specific Data Definition Language (DDL) is utilized as the mechanism by which a user instructs the database to gather statistics on a particular column of a relational table. The RDBMS logic analyzes the data values stored in the column to determine a frequency of occurrence for each value. The collected data is then transformed into some form, suitable for tracking data of that type. For example, for scalar data this information may be transformed into a histogram, storing the frequency of occurrence for the scope of the scalar data. Other forms of storing the frequency data are possible, depending upon the RDBMS vendor. This approach to collecting and storing statistics has worked well, because the principal data types up to recent history have all been scalar data types.
In addition to statistical collection, a commercial database is also provided with a database table column indexing mechanism that enables rows of the database to be retrieved more efficiently via a search condition executed against the table's indexed column. For scalar data, the index search key is usually the same scalar value as that found in the table's column being indexed. Armed with database indexes and statistics, the optimizer is thus able to select an optimal plan for solving an SQL query.
Recently, spatial data types were introduced into the American National Standards Institute (SQL) standard. Spatial data types are not scalar and represent for the most part multi-dimensional shapes: circles, squares, rectangles, polygons, etc. Unlike scalar SQL data types that mainly are concerned with relational operations such as: less than, greater than, equals, etc.; spatial data types have a large number of ANSI defined SQL operations that can be applied against them, such as: touches, intersects, overlaps, within, disjoint, crosses, distance, etc.; all of which must be executed in an efficient manner.
Spatial columns may also be indexed to improve overall retrieval performance. To support spatial indexing, one would typical need a specialized spatial construct such as a Minimum Bounding Rectangle (MBR) Tree-based index (R-Tree index). This index provides a mechanism by which spatial objects may be stored/retrieved while simultaneously maintaining the inter-spatial object relationships. An R-Tree approach is built around the concept of a MBR's. The MBR is used to decide where data is to be stored within the file system/index containing the spatial data.
The problem is how to collect statistics for spatial data residing within either a table column or an R-Tree index that can then be fed to the optimizer to enable it to select the best strategy for solving an SQL query containing spatial predicates and spatial join terms. The histogram model, though useful for scalar data types, is not applicable to the case of spatial data.
Therefore, a mechanism is needed that would enable the optimizer to predict how many database rows might be returned when a particular spatial operation is executed against a table.