The present invention relates to a database system and a method of organizing a data set existing in an n-dimensional cube with n greater than 1.
The so-called B tree (or also B* tree or prefix B tree) is known as the data structure to organize extensive one-dimensional volumes of data in mass memories such as magnetic disk memories. The data structure of the B tree over that of a simple search tree has the advantage that lower search times are required for data access. The resulting search time to locate certain data involves at least log2(n) steps with a simple search tree with n nodes. With a search tree with 1,000,000 nodes, log2(1,000,000)≈20 disk access operations must therefore be expected. Assuming a mean access figure of 0.1 sec., the search of one node will require 2 secs. This value is too large in practice. With the data structure of the B tree, the number of disk access operations is reduced by transferring not one single node, but a whole segment of the magnetic disk allocated to a node to the main memory and searching within this segment. If, for example, the B tree is divided into areas of seven nodes each and if such an area is transferred into the main memory with each disk access operation, the number of disk access operations for the search of a node is reduced from a maximum of 6 to a maximum of 2. With 1,000,000 nodes, only log8(1,000,000)=7 access operations are therefore required. In practice, the search tree is normally divided into partial areas with a size of 28xe2x88x921 to 210xe2x88x921 nodes. With an area size of 255 nodes, log6(1 m) greater than 2.5 disk access operations are required for the search of a node in a tree with 1,000,000 nodes so that the search for a given value takes only around 0.3 secs. The search time within a partial area with 255 nodes located in the main memory can be neglected in comparison with the disk access operation. The B tree is a vertically balanced tree in which all leaves are located at the same level.
The so-called dd trees are known from K. Mehlhorn: Multidimensional searching and computational geometry, Springer, Heidelberg 1984, to organize a multidimensional data set. With the dd trees, three types of queries can be performed in principle, namely point queries, area queries and queries where some intervals are given as (xe2x88x92infinite, +infinite). However, the data structure of a dd tree only allows fast access for point queries, as then only one path in the tree needs to be searched. With the other queries, it is possible that the whole tree has to be searched. Moreover, dd trees are static, i.e. the whole object volume to be organized must already be known before the dd tree can be set up. However, in most applications in practice, the object volume is dynamic, i.e. it must be possible for objects to be inserted into or deleted from the tree in any order and at any time without the whole tree having to be set up again from the start. Furthermore, dd trees are only suitable for main memory applications, but not for peripheral memories which are needed to store very large volumes of data.
In xe2x80x9cThe Grid Filexe2x80x9d by Nievergelt et al, ACM TODS, Vol 9, No. 1, March 1984, so-called grid files are described to organize multidimensional data where queries for points and areas are performed on the basis of an index structure, the so-called grid.
Although this data organization allows a fast search for point and area queries, it is a static procedure so that the total index structure has to be completely reorganized regularly when data objects are inserted or deleted dynamically. This method is thus not suitable for many applications, in particular not for online applications.
So-called R trees are known as the data structure to organize multidimensional data from A. Guttmann: A dynamic index structure for spatial searching, Proceedings ACM SIGMOD, Intl. Conference on Management of Data, 1984, pages 47-57. These trees, which are used mainly for so-called geo-databases, are vertically balanced like B trees and also allow the dynamic insertion and deletion of objects. However, no fast access times are guaranteed for the response to queries, because under certain circumstances any number of paths in the corresponding tree, in extreme cases even the whole tree, have to be searched to answer a query. As a result, these R trees are not suitable for most online applications.
From Y. Nakamura et al: Data structures for multi-layer n-dimensional data using hierarchical structure, 10th International Conference on pattern recognition, Volume 2, Jun. 16, 1990, IEEE Computer Society Press, New Jersey, USA, pages 97-102, a splitting method is known for a multidimensional rectangular space. In the known method, a given multidimensional rectangular space is split into two sub-spaces as soon as the number of data points in the space exceed the capacity of one data page. The splitting of the starting space is performed by cutting out a partial rectangular. The spatial structures newly created by this splitting, namely a cut-out rectangular and the rest of the starting space are structured as layers in a BD tree with the tree structure being created depending on the sequence of the cutting out of the individual partial spaces in the event of multiple cut-out partial spaces. The BD tree structure created in such a way represents a binary tree in which it is determined at each branch node which rectangle will be cut out as the new BD partial space. This successive cutting out has the consequence that the BD tree grows downwards so that in the insertion, deletion and searching of data points, i.e. data objects, in the total space a path has to be passed through from the tree root to a leaf (branch end). Here, it is necessary to check at every intermediate node whether a point being searched for is located in the associated cut-out partial space or in the complementary rest space. The search effort can thus grow proportionally with the size of the data set, which leads to a poor efficiency behavior with large and very large data sets.
The most widespread method in practice today to organize a multidimensional data set is based on the original one-dimensional B trees, with one B tree in each case being used for each dimension of the starting data set so that area queries in an n-dimensional data set are supported by n B trees. In an area query, all objects are thus obtained from the peripheral memory for each dimension whose values are located in the interval specified in the query for this dimension. These data objects form the hit number in the corresponding dimension. To determine the desired answer number, a mean number of the hit numbers of all dimensions must be computed, which will normally first require the sorting of these numbers. When a data object is inserted or deleted, n B trees must also be searched and modified correspondingly.
On this basis, the object of the invention is to provide a database system and a method of organizing an n-dimensional data set which, thanks to improved access times, is, in particular, suited for use in online applications and which allows a dynamic insertion and deletion of data objects.
In accordance with the invention a database system and a method to organize an n-dimensional data set is proposed to solve this object. The database system in accordance with the invention comprises a computing apparatus, a main memory and a memory device, which is in particular a peripheral memory device. The basic idea of the invention is to place a multidimensional data set to be organized in a multidimensional cube and to perform a repeated iterative division of the multidimensional cube in all dimensions into sub-cubes to index and store this data set by means of the computing apparatus. The division is repeated so often here until successive sub-cubes can be combined into regions which each contain a set of data objects which can be stored on one of the memory pages of given storage capacity of the in particular peripheral memory device. As the regions of successive sub-cubes are combined, the regions are also successive so that they form a one-dimensional structure. Thus, in accordance with the invention, when data objects are inserted or deleted, only the modification of one single data structure, for example, a tree, is necessary.
In one embodiment of the invention, the storing of the data objects of a region on one memory page of given storage capacity is performed while allocating a pointer to the memory page and an address defining the region borders. Thus, each region to be stored has allocated to it clear addresses defining the region borders and a pointer pointing to the memory page on which the corresponding region is stored. In this way, the locating of the region and the data objects contained in the region is simplified in organizational routines such as the answering of queries and the deletion or insertion of data objects.
In another embodiment of the invention, the storage of the pointer and the address is made in a B tree, B* tree or prefix B tree so that in an address search, a simple search, which can be performed quickly, can be made in a B tree to identify the required region through the pointer allocated to the address and pointing to the memory page of the required region.
In another embodiment of the invention, the storage of the data objects themselves is made in the leaf pages of the B tree, B* tree or prefix B tree.
In an advantageous embodiment of the invention, the address defining the region borders consists of data on the last of the sub-cubes forming the region. A database system has proved to be very advantageous in which the address comprises data on the number of sub-cubes contained in each division stage in the region. A region is thus clearly defined if the last sub-cube fully contained in the region is also clearly defined by the address data. The start of the region is here given by the address data on the last of the sub-cubes forming the previous region.
In an embodiment of the method in accordance with the invention, a method is proposed. With the method in accordance with the invention, to index and store a multidimensional data set, said data set is placed in an n-dimensional cube with n greater than 1. This cube forms in its totality a starting region containing all data objects of the data set. If the number of existing data objects is smaller than or equal to that of the number of data objects corresponding to the given storage capacity of a memory page, the starting region is stored on one memory page. Otherwise, the starting region is split along a splitting address, with the splitting address being chosen so that two new partial regions are generated roughly along the data center. Each of these partial regions is then treated in the same way as before with the starting region, i.e. the number of data objects contained in the partial region in each case is determined and compared with the number corresponding to the given storage capacity of a memory page. If the data set is not larger than the number corresponding to that of the given storage capacity, then the corresponding region is stored on one memory page, otherwise it is again split along the data center and the process begins afresh.
Advantageously, the storage of the data objects of a region or partial region is made in parallel with the storage of an address allocated to the corresponding region and of a pointer allocated to the address and pointing to the memory page on which the stored data objects are contained. The address to be stored in parallel can advantageously be the splitting address giving the end of the one and the beginning of the other region.
In an embodiment of the invention, the storage of the address and the pointer is made in a B tree, B* tree or prefix B tree, with in each case regions being defined by successive addresses, the data objects of which regions are each stored on one memory page of given storage capacity.
In an embodiment of the method in accordance with the invention, a method is proposed for the insertion of data objects. Advantageously, the stored n-dimensional data set is a data set indexed and stored in accordance with the method in accordance with the invention described above. In accordance with the invention, a region of the n-dimensional data set containing the data object and the memory page on which this region is stored is determined on the basis of the coordinates of the data object to be inserted. Subsequently, the data objects stored on this memory page are counted. If the number of data objects stored is smaller than the number corresponding to the given storage capacity of the memory page, the data object to be inserted is also stored on this memory page. Otherwise a splitting address is selected for the region stored on this memory page in such a way that by splitting the region along this splitting address, a first and second partial region are generated in which in each case less than around half the number of data objects corresponding to the given memory capacity is contained. Then, the data object to be inserted is inserted in that partial region in which the coordinates of the data object lie whereupon the first and the second partial regions are stored on one memory page each.
In accordance with the invention, the dynamic insertion of data objects in the given data structure is thus possible without the totality of the data structure having to be modified or created anew. If as a result of the insertion of the new data object, the region in which the insertion was made can no longer be stored on one memory page, this region is split into two further regions, whereby only the corresponding region to be split into two further regions or the partial regions newly created by the splitting have to be modified and stored anew.
Advantageously, the locating of the memory page is made in the method in accordance with the invention for the insertion of data objects by means of addresses and pointers stored in a B tree, B* tree or prefix B tree and allocated to the memory pages. In this way, the locating of the desired memory page can be particularly simple and fast. Accordingly, it has proved to be advantageous if the storing of the newly created partial regions is made while replacing the prior pointer and the address of the split region by in each case the addresses and pointers allocated to the first and second partial regions. Here, for example, the splitting address can be used as the limiting address for the first partial region and the limiting address of the split region can be used for the second partial region.
It has proven to be particularly advantageous if the storage of the address and the pointer is performed in a B tree, B* tree or prefix B tree, with regions being defined in each case by successive addresses, the data objects of which regions are stored in each case on a memory page of given storage capacity.
In an embodiment of the method in accordance with the invention, a method is proposed for the deletion of data objects. Accordingly, on the basis of the coordinates of the data object to be deleted, that region of the n-dimensional data set containing the data object and the memory page on which this region is stored is determined and the object to be deleted is deleted from this memory page. Subsequently, the number of the data objects stored on this memory page is determined and the region is merged with one of its two neighboring regions if the number of stored data objects is smaller than roughly half the number corresponding to the given storage capacity of the memory page. Then, in turn, the number of the data objects present in the region newly created by the merger is determined. If this number is not greater than the number corresponding to the given storage capacity of a memory page, the region is stored on one memory page, otherwise a splitting address is selected for the region in such a way that by splitting along the splitting address, a first partial region and a second partial region are generated which each contain around half of the data objects contained in the region to be split, whereupon the partial regions created are stored on one memory page each.
Advantageously, in this method, too, the locating of the memory page is made by means of addresses and pointers stored in a B tree, B* tree or prefix B tree and allocated to the memory pages.
In an embodiment of the method in accordance with the invention, a method is proposed for the performance of a data query on the basis of a given n-dimensional query area.
Accordingly, the coordinates of the lowest and the highest point of intersection of the given query area with the n-dimensional data set are determined as is that region in which the lowest point of intersection lies. Then, the memory page is located on which the determined region is stored and all data objects stored on this memory page which form a set of intersection with the query area are determined. The data objects determined are then output. Then, the sub-cube of the determined region which is the last in the sequence is determined, which sub-cube intersects the query area, and the data query is ended if the highest point of intersection of the query area is in this sub-cube. Otherwise, the next sub-cube of the same plane and of the same next higher cube is determined which intersects the query area, and the coordinates of the lowest point of intersection of the query area with the newly determined sub-cube are determined, whereupon the process is continued with the determining of that region in which the lowest point of intersection lies if a sub-cube was determined. Otherwise, the next sub-cube of the plane of the next higher cube is determined which intersects the query area and the determination of the next sub-cube of the same plane and of the same next higher cube intersecting the query area is performed with the sub-cubes of the newly determined cube. If no sub-cube of the plane of the next higher cube is determined, the next higher cube assumes the role of the sub-cube and then the next sub-cube of this plane and of the same next higher cube which intersects the query area is determined. Thus, in accordance with the invention, the sub-cubes of all relevant next higher cubes and in turn, their next higher cubes are examined successively with respect to intersection sets of data objects with the query area.