1. Field of the Invention
This invention relates to a method and apparatus for indexing data in a computer system, and more particularly to a method and apparatus for reducing the time required for searching data in a computer data storage system by using a key index tree structure.
2. Description of Related Art
In the computer arts, data is stored in a storage system having storage devices, such as magnetic disks. For very large databases, it is extremely inefficient and time consuming to search all data records in the storage system in order to find a particular record. A more efficient method is to create a search key for each data record that uniquely identifies the record. Each search key is associated with a data pointer that indicates the location in the computer storage system of the data record associated with the search key. A common type of pointer is a relative record number. Through the use of such pointers, the data records themselves need not be kept in sequential order, but may be stored in random locations in the computer storage system. A search for a particular data record is accelerated by sequentially searching a compiled index of such search keys, rather than the data records themselves. However, this method is still rather inefficient and time consuming.
A much more efficient search method for such an index is to create a tree structure, rather than a sequential file, for the search keys. One such tree structure is a "B-tree". The use of B-trees to structure indexes for data files in computer data storage systems is well known in the prior art. (See, for example, Knuth, The Art of Computer Programming, Vol. 3, pp. 473-479). A B-tree consists of nodes which can be either leaf nodes or branch nodes. A branch node contains a search key and associated pointers (such as relative record numbers) to other branch nodes and/or leaf nodes. A leaf node contains search keys and pointers to data records. One node in the tree is the root node, which can be either a leaf node (only for a tree with a single node) or, more generally, a branch node. In both branch and leaf nodes, the number of pointers is always one greater than the number of search keys. The "height" of a tree is equivalent to the longest number of branch paths from the root node to the leaf nodes.
In the most simple B-tree, each node contains one search key and two associated pointers. Such a tree structure, sometimes referred to as a binary tree, theoretically provides a very efficient search method. If the number of nodes in this type of tree is equal to, or less than 2.sup.n, then only "n" searches are required to locate a data record pointer in any leaf node.
Typically, most databases are stored on relatively slow storage devices, such as magnetic disks. The time required to access any item of data (such as a tree node) on such a storage device is dominated by the "seek" time required for the storage unit to physically locate the desired storage address. Following each seek, the contents of a node may be read into the high-speed memory of the computer system. In a simple binary tree, for each access of a node, only a two-way decision (to the left or right branch from that node) can be made since the node contains only one search key.
If, instead of containing only one search key per node, a node contains several search keys, then for each storage device seek operation, several keys will be read into the high speed memory of the computer system. With one search key per node, a comparison and determination can be made that the item sought for is in one half of the remainder of the tree. With "n-1" search keys per node, the search can be narrowed to "1/nth" of the remainder of the tree. This type of structure is known in the prior art as a "multi-way" tree.
For the purpose of limiting the number of accesses to the storage system, it is generally advantageous to have as many search keys as possible per node. Thus, for each seek that reads a node, several search keys can be examined sequentially and a more efficient determination can be made as to the location of the next node, or data record in the case of a leaf node. The height of the tree, and hence the search time in storage systems with relatively slow access speeds, is dramatically decreased if the number of search keys per node is increased. However, especially in high speed data storage devices presently available, there is a point at which the time required to search sequentially through a node becomes substantial with respect to the time required to access an item of data.
An example of a system in which a number of keys are stored in each branch and leaf node is a database system that operates on the IBM System/34 computer. In that system, each node is 256 bytes long, corresponding to that computer system's magnetic disk sector size. In this example computer system, the key length is up to 29 bytes. Using 3-byte relative record numbers for pointers, the maximum number of search keys that can be inserted into each node of that system is eight. For that computer system, it is advantageous to have a search tree structure that contains more than eight search keys per node.
A system which provides such a tree structure using a variation of the B-tree called a "Bit-tree" is described in U.S. Pat. No. 4,677,550, assigned to the assignee of the present invention. A Bit-tree is similar to a B-tree in that it consists of leaf nodes and branch nodes, with one of the nodes in the tree being the root node. Branch nodes are essentially identical to branch nodes in a standard B-tree. Typically, the root node is not larger than any other branch node. In one such system, each node is 256 bytes long, and the system uses 3-byte relative record numbers for pointers. Thirteen bytes per node are used for system information purposes. The remaining 243 bytes of each node can be used for search keys and their associated relative record numbers. Under this system, if "k" is the length of a search key, and each search key is associated with a 3-byte relative read number, then the maximum number of search keys per node is (256-13)/(k+3). As implemented in the IBM system/34 computer, the maximum number of maximum length search keys per branch node is therefore seven (k=29 bytes).
The principal difference between a Bit-tree and a standard B-tree is the use of "distinction-bits" in place of search keys in all leaf nodes. A distinction-bit is determined by comparing two adjacent search keys, and calculating the ordinal number of the highest order bit that is different between the two keys. (In the '550 invention, the binary number "1000" is added to each distinction-bit in order to simplify the search method). If the maximum search key length permitted is 29 bytes, and there are eight bits per byte, the maximum length of a search key is 232 bits. Thus, the ordinal number representing any one of those 232 positions need only be eight bits, or one byte, in length (even taking into account the 8-count displacement added into each distinction-bit).
Distinction-bits along with their associated relative record numbers are inserted in each leaf node instead of search keys. The maximum number of one-byte distinction-bit entries plus relative record numbers in such a leaf node is therefore 243/(1+3), or 60, for keys with length of 29 bytes. This use of distinction-bits is the principal advantage of Bit-trees. Because each branch refers to dozens of descendant nodes, almost all nodes in a tree structure are leaf nodes. Since Bit-tree leaf nodes contain more entries than nodes containing standard search keys, there are fewer nodes in the tree to seek and read. Further, less storage space is required for the tree itself, since more information is packed into fewer leaf nodes. Thus, a computer system using a Bit-tree structure is significantly more efficient than prior art B-tree search tree structures.
However, as the access rate of data storage units on which a database can be stored has increased, the time required to sequentially search a leaf node has become more and more significant. Therefore, there is a need for a method and apparatus for sorting and searching for data which accelerates the search through leaf nodes of a key index tree structure in a database.
The following invention provides such a method and apparatus.