The invention relates to computer data and file storage systems, and more particularly to a method and system for deleting search keys from a structure implementing a compact representation of a 0-complete tree.
Data and file storage systems such as a database, in particular those implemented in computer systems, provide for the storage and retrieval of specific items of information stored in the database. The information stored in the database is generally indexed such that any specific item of information in the database may be located using search keys. Searches are generally accomplished by using search keys to search through an index to find pointers to the most likely locations of the information in the database, whether that location is within the memory or the storage medium of the computer.
An index to database records within a computer is sometimes structured as a tree comprised of one or more nodes, connected by branches, which is stored within a storage means of the computer. Each node generally includes one or more branch fields containing information for directing a search, and each such branch field usually contains a pointer, or branch, to another node, and an associated branch key indicating ranges or types of information that may be located along that branch from the node. The tree, and any search of the tree, begins at a single node referred to as the root node and progresses downwards through the various branch nodes until the nodes containing either the items of information or, more usually, pointers to items of information are reached. The information related nodes are often referred to as leaf nodes or, since this is the level at which the search either succeeds or fails, failure nodes. Within a tree storage structure of a computer, any node within a tree is a parent node with respect to all nodes dependent from that node, and sub-structures within a tree which are dependent from that parent node are often referred to as subtrees with respect to that node.
The decision as to which direction, or branch, to take through a tree storage structure in a search is determined by comparing the search key and the branch keys stored in each node encountered in the search. The results of the comparisons to the branches depending from a given node are to be followed in the next step of the search. In this regard, search keys are most generally comprised of strings of characters or numbers which relate to the item or items of information to be searched for within the computer system.
The prior art contains a variety of search tree data storage structures for computer database systems, among which is the apparent ancestor from which all later tree structures have been developed and the most general form of search tree well known in the art, the xe2x80x9cB-tree.xe2x80x9d (See, for example, Knuth, The Art of Computer Programming, Vol. 3, pp. 473-479, the disclosure of which is incorporated herein by reference). A B-tree provides both primary access and then secondary access to a data set. Therefore, these trees have often used in data storage structures utilized by database and file systems. Nevertheless, there are problems that exist with the utilization of B-tree storage structures within database systems. Every indexed attribute value must be replicated in the index itself. The cumulative effect of replicating many secondary index values is to create indices which often exceed the size of the database itself. This overhead can force database designers to reject potentially useful access paths. Moreover, inclusion of search key values within blocks of the B-tree significantly decreases the block fan out and increases tree depth and retrieval time.
Another tree structure which can be implemented in computer database systems, compact 0-complete trees (i.e., C0-trees), eliminates search values from indices by replacing them with small surrogates whose typical 8-bit length will be adequate for most practical key lengths (i.e., less than 32 bytes). Thus, actual values can be stored anywhere in arbitrary order, leaving the indices to the tree structure to be just hierarchical collections of (surrogate, pointer) pairs stored in an index block. This organization can reduce the size of the indexes by about 50% to 80% and increases the branching factor of the trees, which provides a reduction in the number of disk accesses in the system per exact match query within computer database systems. (See Orlandic and Pfaltz, Compact 0-Complete Trees, Proceedings the 14th VLDB Conference, pp. 372-381, the disclosure of which is incorporated herein by reference.)
While the known method of creating C0-trees increases storage utilization 50% to 80% over B-trees, there is a waste of storage space due to the presence of dummy entries (surrogate, pointer==NIL) wherein the number of index entries at the lowest level of the tree exceeds the actual number of records stored. Therefore, the expected storage utilization of index entries of C0-trees at the lowest tree level is 0.567 versus 0.693 as in the case of B-trees.
Moreover, although B-trees and C0-tree storage structures represent efficient methods of searching for values, both methods require initial generation and subsequent maintenance of the tree data storage structure itself. Neither of these computer storage structures inherently stores information in sorted order.
A tree can be built more efficiently if the key records are initially sorted in the order of their key field, than if records are in random order. Therefore, an efficient computer database system should sort sets of keys first, and then build a tree based on keys extracted at intervals from the sorted keys. Searches of the tree data storage structure will also be performed more efficiently if the tree does not contain an excess number of keys, namely keys that are associated with data no longer in the database or keys that are no longer necessary to maintain the structure of the tree.
One method of deleting excess keys is to search the tree structure for a key, and then perform steps appropriate for deletion of the key from the tree structure. This process is then repeated for every key desired to be deleted. In effect, for every key deleted from the tree structure, a search, beginning at the root of the tree and ending when the key is located, must be performed. Such a process may impose a large burden on the computer system storing the tree structure. This is particularly the case if a large number of keys are to be deleted at the same time or during the same process, which may often be the case as a determination as to which keys should be deleted may occur only periodically or deletion operations may be planned to occur during periods of low utilization of the data storage system. Accordingly, methods for deleting significant numbers of search keys from the tree structure without performing individual searches throughout the tree structure for those search keys are desirable.
In an embodiment of the invention, a data storage structure for minimizing the amount of information required to retrieve stored data within a computer system is comprised of entries for indexing search keys. Each entry comprises a depth value and a data present indicator having two conditions, and a tree structure stored in the computer interconnecting the entries and forming the data storage structure. Search keys may be binary representations of the data records indexed by the data storage structure or may be any other attribute or set of attributes used to look up the data records. The data storage structure further comprises a means for storing a count of each of the entries associated with a search key interval range.
The described embodiment of the present invention includes novel methods for storing, accessing and retrieving data indexed by the tree data storage structure. These methods comprise a method for sequentially processing a number of search keys within the tree structure to perform a predefined function on each search key, a method for locating a search key within the tree structure, a method for storing and indexing information for each search key within the tree structure, and a method for splitting an index block of the present invention.
An embodiment of the present invention also provides additional efficiency with regards to storage utilization beyond the already stated 50% to 80% savings of C0-trees over B-trees. To alleviate the problem of waste created by C0-trees at the lowest levels, a preferred embodiment of the invention replaces storage of the (surrogate, pointer) entries as physically adjacent pairs of values with two separate physical structures within the storage means of the computer, namely: 1) an index block depths structure to a list of the surrogate values including a depth value and a data present structure indicator having two conditions and 2) a pointers structure pointing to a list of every non-NIL pointer value, these being in lexicographic order. In the preferred embodiment of the invention, a NIL pointer of the prior C0-tree data storage structure (i.e., a dummy entry) is represented by a data not present indicator bit in the value of the surrogate itself. The meta-data of each subtree also reflects the count of non-NIL entries for each subtree to accumulate an incremental lexical position for each indexed key within the pointers structure. Since the pointers structure does not contain any NIL pointers, only a bit of storage is necessary to indicate a NIL pointer and minimal meta-data is recorded. Therefore, storage utilization tends to revert back to that which is expected with a B-tree.
Moreover, to eliminate the inefficiency of traversing a multilevel tree structure, the keys to be added to the data storage structure of the preferred embodiment of the present invention are processed as a collection or more than one item in sorted order. In this way, greater locality of reference and reduction in traversal and maintenance of the nodes of the tree (including index block splitting) from the root to the leaf for each key can be realized. By determining if the next key is included in the key interval range of the current index block, processing of a predefined function can continue in the current index block or resume in its parent block. With this new method, splitting of an index block is deferred until all processing of the current block is completed or the block size is at an extreme maximum far greater than the normal threshold, thus, allowing for context retention of the subtree until all relevant keys have been added to that subtree.
Still other embodiments of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is shown and described only embodiments of the invention by way of illustration for carrying out the invention. As will be realized, the invention is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit of the invention. Accordingly, the drawings and the detailed description are to be regarded as illustrative in nature and not restrictive.