The present invention relates to a bulk loading method for use in a high-dimensional index structure; and, more particularly, to a bulk loading method of using some parts of dimensions to thereby accelerate an index construction and improve a search performance, and a computer readable recording medium in which a program implementing the method is recorded.
In general, a bulk loading method is a computational method which constructs an index by inserting a substantial amount of static high-dimensional data at one time instead of inserting one-by-one data, thereby reducing an index construction time and improving a search performance.
Conventional bulk loading algorithms are classified according to whether or not employing a sort and further divided into a bottom-up and a top-down fashion.
First of all, as methods for performing the sort, there are following methods such as NX[Roussopoulos, xe2x80x9cDirect Spatial Search On Pictorial Databases Using Packed R-Treesxe2x80x9d, Proc. ACM SIGMOD, 1985], HS[Kamel, xe2x80x9cOn Packing R-Treesxe2x80x9d, Proc. 2nd International Conference on Information and Knowledge Management, 1993], STR[Scott T. Leutenegger, xe2x80x9cSTR: A Simple and Efficient Algorithm for R-Tree Packingxe2x80x9d, Proc. 13th ICDE, 1997], TGS[Yvan J. Garcia, xe2x80x9cA Greedy Algorithm for Bulk Loading R-Treesxe2x80x9d, University of Denver Computer Science Tech., 1997], BnC[Aggarwal et al., System and Method for Construction of a Data Structure for Indexing Multidimensional Objects, U.S. Pat. No. 5,781,906], and the like.
Among them, algorithms performing the index construction using the bottom-up fashion will be first explained.
It is easy to implement an NX algorithm that constructs an index after executing a sort only for an X-axis and this algorithm has a good search performance for a point query. However, since Y-axis values are neglected and thus perimeters of parent nodes become larger, there occurs a defect to deteriorate a performance for a region query.
Therefore, HS has introduced an algorithm using a storage sequence determined by a Hilbert curve. Since the Hilbert algorithm reduces a perimeter of an entire region as managing a size of the entire region as small, it can substantially advance a performance for a region query compared with the NX algorithm.
Meanwhile, it is easily implemented a STR(Sort-Tile-Recursive) algorithm clustering the rectangles in order to minimize the number of visiting nodes during the execution of query processing and this algorithm can obtain a substantially enhanced performance compared with conventional algorithms.
However, sometimes the Hilbert algorithm can accomplish a better performance for VLSI data compared with the STR algorithm. Therefore, it can be said that the performances of the Hilbert algorithm and the STR algorithm depend on the type of given data. The above NX, HS and STR algorithms construct the index in the bottom-up fashion by using a fixed pre-processing.
On the other hand, TGS has proposed a Top-down Greedy Split (TGS) algorithm constructing a tree in the top-down fashion. The TGS algorithm uses a pre-processing for rearranging the data contained in a database and can construct a tree requiring fewer disk accesses than trees made by conventional methods. However, since a multiple rearrangement should be considered, there is a drawback in which a longer time is required to construct a tree.
In order to prevent a search performance from being deteriorated by an increase of an MBR (Minimum Bounding Region) and an overlapped region of an algorithm such as the HS algorithm using a linear ordering method, a newly proposed BnC method constructs an index tree by using two stages, i.e., binarization and compression. However, this method consumes a longer construction time because of using a large amount of sort for split dimensions in the binarization stage.
Then, as bulk loading algorithms not performing the sort, there are BFT [Van Den Bercken, xe2x80x9cA General Approach to Bulk Loading Multidimensional Index Structuresxe2x80x9d, 23rd Conf. VLDB, 1997] and UBBT[Christian Bxc3x6hm, xe2x80x9cEfficient Bulk Loading of Large High-Dimensional Indexesxe2x80x9d, Int. Conf. on Data Warehousing and Knowledge Discovery Dawak, 1999]. The bulk loading algorithm of BFT constructs an effective index structure by using a split or merging scheme instead of a data sort scheme. Specially, this algorithm can reduce an index construction time by using a buffer tree including a buffer on an index node.
The UBBT is a method composed of the bottom-up fashion without performing the sort. This method has characteristics of using a binarization method when dividing a data set instead of the sort and applying an unbalanced split method for the binarization. The conventional algorithms except for the UBBT have a disadvantage of improving a performance only for one of an index construction time and a search time.
That is, although the methods employing the sort improve the search performance, it increases an index construction time by computations required for performing the sort. On the other hand, the methods not performing the sort can reduce the index construction time since it does not perform the sort, whereas it deteriorates the search performance.
It is, therefore, an object of the present invention to provide a bulk loading method, for use in a high-dimensional index structure using some parts of dimensions based on an unbalanced binarization scheme, for reducing the index construction time and improving the search performance, and a computer readable recording medium in which a program implementing the method is recorded.
In accordance with one aspect of the present invention, there is provided a bulk loading method employed in a high-dimensional index structure for constructing an index, which comprises the steps of: (a) calculating the topology of the index by recognizing the information for the index to be constructed with a given data set; (b) splitting the given data set into sub-sets of data by repeatedly performing an establishment of a split strategy and a binarization based on the calculated topology of the index; (c) if a leaf node is derived from the sub-sets of data divided through a top-down recursive split process, reflecting a MBR of the leaf node on a higher node; and (d) if a non-leaf node is generated, repeatedly performing the steps (a) to (c) for another sub-set of data to thereby produce a final root node.
In accordance with another aspect of the present invention, there is provided a computer program product for use in an indexing unit including a process, comprising: a computer readable medium; a first program instruction code for calculating the topology of the index by recognizing the information for the index to be constructed with a given data set; a second program instruction code for splitting the given data set into sub-sets of data by repeatedly performing an establishment of a split strategy and a binarization based on the calculated topology of the index; a third program instruction code for, if a leaf node is derived from the sub-sets of data divided through a top-down recursive split process, reflecting a MBR of the leaf node on a higher node; and a fourth program instruction code for, if a non-leaf node is generated, repeatedly executing the above processes performed by the first to third program instruction codes for another sub-set of data to thereby produce a final root node.