The present invention relates to an efficient method of partitioning and indexing multi-dimensional data together with a conceptually simple and computationally efficient method of selectively retrieving subsets of the data, where subsets are defined as the data lying within rectangular regions within multi-dimensional space.
A file or database of multi-dimensional data contains representations of logical entities, each of which is described by an ordered set of attributes, or ‘dimensions’ or coordinates. Moreover, these entities need to be organized and stored in some way so that sub-sets can be selectively and ‘efficiently’ retrieved according to values or ranges of values in one or more of any of their attributes.
The most commonly applied solution at present is the ‘relational database’ model, originating from the late 1960s, although this is not specifically a multi-dimensional method. The relational method usually orders data according to values in a set of one or more attributes whose values are placed in a ‘primary index’. This facilitates easy access to subsets of data whose primary key values lie within a given range of values. In order to facilitate retrieval of a subset of data whose values in a non primary key subset of one or more attributes lie within a given range of values, secondary indexes are commonly used on one or more subsets of non primary key attributes. This gives rise to a number of problems.
Firstly, there is a practicable limit to the number of secondary indexes that can be supported. Potentially 2n−2 indexes can be provided for n-dimensional data but to provide all of these requires an excessive amount of storage capacity and an excessive maintenance overhead when data is updated. In practice only a limited amount of secondary indexes is only ever provided.
Secondly, if a desired secondary index does not exist, some queries require the costly operation of intersecting two or more indexes or intersecting the results of two or more subsidiary queries.
Thirdly, retrieval of data using any secondary index can never be as efficient as using a primary index, since the underlying data is not actually ordered by the secondary index, thus more pages of data in a database need to be retrieved and searched. In consequence, some forms of query are more efficiently executed than others.
The relational implementations may have hitherto adapted to changing requirements for handling data but this does not ensure that they will be able to provide universal solutions in the future. Data generation and collection continues to grow at an ever accelerating rate along with aspirations for more sophisticated analysis and processing techniques and capabilities. Data that is being generated and collected is also becoming increasingly high-dimensional in its nature.
A considerable volume of research has been carried out in the area of indexing multi-dimensional data over many years. Nevertheless, no paradigm appears to have emerged to compare with the pre-eminence of the B-Tree and its variants in the indexing of one-dimensional data. Indeed, the volume of previous and continuing research provokes the conclusion that the development of an optimum strategy for indexing multi-dimensional data very much remains a problem unsolved.
A design of a ‘true’ multi-dimensional database organization method attempts to solve the conflicting problems of how to store data compactly while enabling it to be selectively retrieved, or ‘queried’, flexibly and efficiently.
Most file organization methods partition data by dividing the space in which it is embedded into rectangles, or their equivalent in higher dimensions, and an index entry is created for each rectangle. When only a few rectangles are defined the index can be accommodated in memory and serially searched as updates and queries are performed. Once the number of such rectangles exceeds some threshold, they must be partitioned, initially, into 2 sub-sets, each of which is most commonly regarded as a node in a tree index structure. A problem arises, often immediately, in that rectangles enclosing a pair of sub-sets of smaller rectangles may overlap. Thus where data insertion is required, for example, it may be necessary to search more than one path in an index tree to locate the page on which to place the data. Avoiding or accommodating this has been the focus of much of the research into organizing multi-dimensional data.
The Grid File, described by Jürg Nievergelt, Hans Hinterberger and Kenneth C. Sevcik in “The grid file: An adaptable symmetric multikey file structure”, ACM Transactions on Database Systems, 9(1):38–71, 1984, is characterized by an exponential directory growth rate, a need for periodic significant directory reorganization and the requirement for directory lists to be intersected in the execution of queries.
The R-Tree index, described by Antonin Guttman in “R-Trees: A dynamic index structure for spatial searching”, SIGMOD'84: Proceedings of the Annual Meeting, volume 14(2) of Sigmod Record, pages 47–57, ACM, 1984, is balanced and simple in comparison with many other methods and not subject to the same degree of reorganization on insertion and deletion of data. However, these benefits are gained at the expense of tolerating overlapping rectangles which can significantly degrade query performance. The R-Tree was designed for indexing multi-dimensional spatial data rather than point data.
The BANG File, described by M. Freeston in “The BANG File: A new kind of grid file” in Proceedings of the Association for Computing Machinery Special Interest Group on Management of Data 1987 Annual Conference, San Francisco, May 27–29, 1987, pages 260–269, tolerates overlapping rectangles but in a more controlled manner than in the R-Tree. It requires a complex, although balanced, index structure. This design has been the subject of a number of papers although none addresses algorithms for executing queries. I do not believe that this is because they are dealt with trivially.
The SS-Tree, described by David A. White and Ramesh Jain in “Similarity indexing with the SS-Tree” in Proceedings of the Twelfth International Conference on Data Engineering, Feb. 26–Mar. 1, 1996, New Orleans, pages 516–523, IEEE Computer Society, 1996, is a ‘similarity’ indexing method oriented to the storage of multi-dimensional data in a manner which supports ‘similarity queries’. Such queries may be of the forms, “find objects similar to a reference” and “find pairs of objects which are similar”, in addition to conventional query forms. Data is transformed into ‘feature vectors’, which take account of the varying significances of values in different dimensions. A significant disadvantage, however is that the input of a domain expert is required for this purpose. Space containing feature vectors is then partitioned into spheres which theoretically contain the k-nearest neighbours of their centre points. I say ‘theoretically’ since, in practice, there is a problem in that spheres may overlap in the way that the rectangles of the R-Tree do.
The SR-Tree, described by Norio Katayama and Shin'ichi Satoh in “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries” in SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13–15, 1997, Tucson, Ariz., USA, ACM Press, 1997, pages 369–380, is similar to the SS-Tree, with its attendant problems, except that feature space is partitioned into regions defined by the intersection of spheres and rectangles (where spheres are not wholly contained within the rectangles). The benefit of this approach is that partitions overlap to a lesser extent than in the SS-Tree.
The X-Tree, described by Stefan Berchtold, Daniel A. Keim and Hans-Peter Kriegel in “The X-Tree An Index Structure for High-Dimensional Data”, VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, Sep. 3–6, 1996, Mumbai (Bombay), India, 1996, pages 28–39, addresses the problem of overlapping regions manifest in the R-Tree. This is achieved at the cost of allowing nodes in a tree to be of variable rather than fixed size.
If the rectangles in an overfull node cannot be partitioned into 2 roughly equal sized sub-sets whose minimum bounding boxes overlap within the limits defined in some threshold then an ‘overlap-free’ split is sought. This entails consulting a data structure which records the history of previous splits. At least one overlap-free split can always be found for a node but if it results in one of the new nodes being populated with fewer rectangles than defined in some threshold, then the original node is not split but allowed to become enlarged instead. The X-Tree is thus a hybrid between a linear array index and an R-Tree index.
Whereas spatial objects may overlap, this is clearly not the case with point data. It is debatable whether data organization methods which permit overlapping regions in the partitioning of point data do so because they are specializations of methods primarily designed for spatial data or because there is some inherent advantage in tolerating overlap.
The Pyramid-Technique, described by Stefan Berchtold, Christian Böhm and Hans-Peter Kriegel in “The Pyramid-Tree: Breaking the Curse of Dimensionality”, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Jun. 2–4, 1998, Seattle, Wash., USA, ACM Press, 1998, pages 142–153, partitions the data space in a 2-stage process. In the first stage, it is divided into pyramids all of whose apexes lie at the centre of the space. In the second stage, each pyramid is divided into slices, the bases of which are all hyper-planes parallel to the base of the pyramid. Each slice of a pyramid corresponds to a page of the data file.
Multi-dimensional points are transformed into one-dimensional values by a mapping which is not bijective, thus more than one point may map to the same value, which necessitates an overhead of the storage of both the coordinates of points and their one-dimensional values. A one-dimensional value designates which pyramid a point lies in and its height above its base. The paper describes a query processing algorithm which the authors acknowledge is a “complex operation”.
The technique can be adapted to skewed data distributions by moving the apex of all of the pyramids into the centre of a data cluster, creating asymmetrical pyramids. In practice data sets may contain more than one cluster and the locations of clusters may be dynamic. A dynamic pyramid apex location does not, however, appear to be practicable.
The manner in which pyramids are divided into slices appears to suggest that partitioning may degrade locally such that all points on a page share similar values in one dimension but potentially very diverse values in all others.
The use of the Hilbert Curve, named after David Hilbert, the German mathematician, in the indexing of multi-dimensional data has been suggested by Christos Faloutsos and Yi Rong in “DOT: A Spatial Access Method Using Fractals”, Proceedings of the Seventh International Conference on Data Engineering, Apr. 8–12, 1991, Kobe, Japan, IEEE Computer Society, 1991, pages 152–159, but no application other than that of the present inventor has been developed and, most importantly, no querying algorithm has hitherto been invented. Without a querying algorithm, enabling data to be selectively retrieved without searching the entire database, the application of the Hilbert Curve in the indexing of multi-dimensional data is of little value or viability.