With the amount of information often desired to be stored in a database system increasing, data or complete records are often stored in more than one database storage site. One important aspect of the database programs is the ability to provide fast and efficient access to records in each individual database. To properly handle the distribution and retrieval of the data, data processing systems often include database management programs. These programs provide easy access to database information that may each consist of a multiple of records stored at many nodes and sites. Relational database management programs provide this capability.
One common configuration of a database is one that is made up of various tables with each table consisting of rows and columns of information. The information stored across one row in the table would make up one record and the fields of the record would be columns in the table. In other words, the table would contain rows of individual records and columns of record fields. Because one record may contain more than one field of information, the information of the field would make up the columns of the database table. Other database configurations are found in the art. Database management programs support multiple users thereby enabling each user to access the same table concurrently.
An index file is commonly used by database management programs to provide quick and efficient associative access to a table's records. These index files are commonly configured in a B-Tree structure which consists of a root node with many levels of nodes branching from the root node. The information contained in these nodes may include pointers which point to the nodes at the next level of the tree or it may include pointers which point to one or more records stored in the database. These pointers include additional key record information which may reference the records stored in the database. The record keys are stored in an ordered form throughout the nodes at the various branches of the tree. For example, an index tree may exist for an alphabetic listing of employee names. The root node would include reference key data that relates to individual record information that may be indirectly or directly referenced by the next level of nodes in the tree. The reference keys contain information about the index field, e.g., the alphabetic spelling of the employee's name. Therefore, the ordered keys in the root node would point to the next successive level of nodes. In other words, the next successive node may indirectly or directly reference all employee names beginning with letters A-H. A next successive node, parallel with the first successive node, may contain employee records whose last name begins with the letters I-P. The last successive node on this level would reference records of employees with last names starting with Q-Z. As one searches through the index tree, a bottom node is eventually reached. The contents of the bottom node may include record key information that further points to individual records in storage or may point back to one of the branch nodes in the tree.
During recent years, a variety of new database applications have been developed which substantially differ from conventional database applications in many respects. For example, new database applications such as data warehousing produce very large relations which require a multidimensional view on the data, and in areas such as multimedia, a content-based search is essential which is often implemented using some kind of feature vectors. All the new applications have in common that the underlying database system has to support query processing on large amounts of high-dimensional data. The question is what is the difference between processing low- and high-dimensional data. A result of recent research activities is that basically none of the querying and indexing techniques which provide good results on low-dimensional data also performs sufficiently well on high-dimensional data for larger queries. Previously, the only approach taken to solve this problem for larger queries was parallelization. A variety of new index structures, cost models and query processing techniques have been proposed. Most of the index structures are extensions of multidimensional index structures adapted to the requirements of high-dimensional indexing. Thus, all these index structures are restricted with respect to the data space partitioning. Additionally, they suffer from the well-known drawbacks of multidimensional index structures such as high costs for insert and delete operations and a poor support of concurrency control and recovery. Recently, a few high-dimensional index structures have been proposed.
In "The TV-Tree: An Index Structure for High-Dimensional Data" by K. Lin et. al., VLDB Journal, Vol. 3, pp. 517-542, 1995, Lin et al. presented the TV-tree which is an R-tree-like index structure. The central concept of the TV- tree is the telescope vector (TV). Telescope vectors divide attributes into three classes: attributes which are common to all data items in a subtree, attributes which are ignored and attributes which are used for branching in the directory. The motivation for ignoring attributes is that a sufficiently high selectivity can often be achieved by considering only a subset of the attributes. Therefore, the remaining attributes have no chance to substantially contribute to query processing. Obviously, redundant storage of common attributes does not contribute to query processing either. The major drawback of the TV-tree is that information about the behavior of single attributes, e.g. their selectivity, is required.
Another R-tree-like high-dimensional index structure is the SS-tree which uses spheres instead of bounding boxes in the directory and was disclosed in "Similarity Indexing With the SS-Tree" by D. A White and R. Jain, Proc 12th Int. Conference on Data Engineering, New Orleans, La., 1996. Although the SS-tree clearly outperforms the R.sup.* -tree, spheres tend to overlap in high-dimensional spaces. Thus, recently a improvement of the SS-tree has been proposed in "The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries" by N. Katayama and S. Satah, Proc. ACM SIGMOD Int. Conference on Management of Data, 1997, pp. 369-380, where the concepts of the R-tree and SS-tree are integrated into a new index structure, the SR-tree. The directory of the SR-tree consists of spheres (SS-tree) and hyper-rectangles (R-tree) such that the area corresponding to a directory entry is the intersection between the sphere and the hyper-rectangle. Therefore, the SR- tree outperforms both the R.sup.* -tree and the SS-tree.
In "Similarity Indexing: Algorithms and Performance" by R. Jain and D. A. White, Proc. SPIE Storage and Retrieval for Image and Video Databases IV, Vol. 2670, pp. 62-75, San Jose, Calif. 1996, the VAM-Split R-tree and the VAM-Split KD-tree are introduced. Both are static index structures, i.e., all data items must be available at the time of creating the index. VAM-Split trees are rather similar to KD-trees, however in contrast to KD-trees, splits are not performed using the 50%- quantile of the data according to the split dimension, but on the value where the maximum variance can be achieved. VAM Split trees are built in main memory and then stored on secondary storage. Therefore, the size of a VAM Split tree is limited by the main memory available during the creation of the index.
In "The X-Tree: An Index Structure for High-Dimensional Data" by S. Berchtold et. al., 22nd Conference on Very Large Databases, Bombay, India, pp.28-39, 1996, the X-tree has been proposed which is an index structure adapting the algorithms of R.sup.* -trees to high-dimensional data using two techniques: First, the X-tree introduces an overlap-free split algorithm which is based on the split history of the tree. Second, if the overlap-free split algorithm leads to an unbalanced directory, the X-tree omits the split and the according directory node becomes a so-called supernode. Supernodes are directory nodes which are enlarged by a multiple of the block size. The X-tree outperforms the R.sup.* -tree by a factor of up to 400 for point queries.
All these approaches have in common that they must use the 50%-quantile when splitting a data page in order to fulfill storage utilization guarantees. As will be shown below, this is the worst case in high-dimensional indexing, because the resulting pages have an access probability close to 100%.
To overcome this drawback, S. Berchtold et. al. recently proposed another approach in "Improving the Query Performance of High-Dimensional Index Structures Using Bulk-Load Operations" 1998, where unbalanced partitioning of space was applied. The proposed technique is an efficient bulk-loading operation of an X-tree. However, the approach is applicable only if all the data is known a priori which is not always the case. Additionally, due to restrictions of the X-tree directory, a peel-like partitioning cannot be achieved which is important for indexing high-dimensional data spaces, as will be described below.