Very large databases pose significant challenges in storing, accessing, and managing data. Often, large databases attribute their size to a few very large tables or other database objects as opposed to a very large number of smaller tables. For example, most data warehouses and decision support systems have a few very large fact tables and several smaller lookup tables. As another example, spatial databases typically store the shape and location of large numbers of objects such as houses, streets, and zip code areas.
Partitioning is a strategy employed to improve the storage, access, and administrative performance of very large database objects. More specifically, a database object such as a relational database table or index is partitioned by subdividing the database object into several smaller independent subsets of the database object based on a "partition key." For "range partitioning" of non-spatial, relational database tables, the partition key is typically selected from among the date, primary key, or foreign key columns. The partition key is then used to divide the table into one or more ranges of values. When a query is issued to access and retrieve data from a range partitioned table, the database server can ignore partitions outside of certain ranges, thereby improving query performance and response time. For example, a very large on-line transaction processing (OLTP) database can be range partitioned based on the year. Thus, queries on the OLTP database that specify the year 1997 can safely ignore data in partitions for other years.
Spatial databases also benefit from partitioning. However, range partitioning on a spatial dimension is often inappropriate for spatial databases. For example, a common spatial query is: given a particular point in space, what objects are within a certain distance of the point? With this type of query, it is desirable to group data that is spatially close to each other, but range partitioning, on the other hand, groups data primarily according to a single dimension. For example, range partitioning along the x-axis may groups the point (1, 1) in the same partition as point (1, 100) even though point (2, 2) is much closer. Range partitioning is also inappropriate for multi-dimensional non-spatiotemporal data such as sales data (e.g., dollars and product codes) when no one dimension dominates over the other. Therefore, there is a need for a data partitioning methodology that can handle spatiotemporal data as well as non-spatiotemporal data in a way that does not cause one particular dimension to dominate the partitioning criteria.
Conventionally, data partitioning operates best if the entire set of data is provided at the time of partitioning, because an optimal size of the partitions is a function of the size of the data set and influences the performance of the query. In many real world situations, however, such as receiving spatial data from a satellite, the entirety of the data is not immediately available and therefore the data set needs to be incrementally partitioned.
Under conventional methodology, the data is originally partitioned based on a guess of the optimal partition size and then repartitioned after all the data has been collected. This process of repartitioning is very expensive and typically involves exporting the data from the database to a flat file outside the database system. The flat file is then analyzed to determine the partitions. While the data is being analyzed in the flat file, it is unavailable for queries. Furthermore, many of the high performance and reliability features that are characteristic of relational databases are unavailable to this extra-database partitioning process. For example, if the system goes down during the off-line repartitioning, the recovery features of the database system cannot be used to recover the repartition operation. Therefore, the repartition process is forced to start all over again. Consequently, the "downtime" of the data can be considerable if a large data set is partitioned. Therefore, there is a need for a data partitioning methodology that efficiently handles incremental partitioning and reduces the downtime during partitioning.
In conventional relational database systems, partitions are stored as tables organized into rows and columns. However, rows may be stored in the table in any order, so that rows that contain information about two points that are spatially close to each other could reside on separate disk blocks in the database, requiring multiple disk input/output operations to retrieve them. Therefore, there is a need for intra-partition physical clustering of spatial data so that the records for closely related spatial data will reside close to each other.