1. Field of the Invention
This invention relates to a method for searching data in a computer data storage system, and more particularly to an improved method for implementing a data index tree structure and an improved method for searching such a structure.
2. Description of the Prior Art
In the computer arts, data is stored in some form of storage system, such as magnetic disks. For very large data bases, it is extremely inefficient and time consuming to search all data records in the storage system in order to find a particular record. A more efficient, but still cumbersome and time consuming, method is to create a search key for each data record that uniquely identifies the record. Each search key is associated with a data pointer that indicates the location in the computer storage system of the data record associated with the search key. A common type of pointer is a relative record number. Through the use of such pointers, the data records themselves need not be kept in sequential order, but may be stored in random locations in the computer storage system. A search for a particular data record is speeded up by sequentially searching a compiled index of such search keys, rather than the data records themselves.
A much more efficient search method for such an index is to create a tree structure, rather than a sequential file, for the search keys. One such tree structure is a "B-tree". The use of B-trees to structure indexes for data files in computer data storage systems is well known in the prior art. (See, for example, Knuth, The Art of Computer Programming, Vol. 3, pp. 473-479). A B-tree consists of nodes which can be either leaf nodes or branch nodes. A branch node contains a search key and associated pointers (such as relative record numbers) to other nodes. A leaf node contains pointers to data records. One node in the tree is the root node, which can be either a leaf node (only for a tree with a single node) or a branch node. In both branch and leaf nodes, the number of pointers is always one greater than the number of search keys. The "height" of a tree is equivalent to the longest number of branch paths from the root node to the leaf nodes.
In the most simple B-tree, each node contains one search key and two associated pointers. Such a tree structure, sometimes referred to as a binary tree, theoretically provides a very efficient search method. If the number of nodes in this type of treeis equal to or less than 2.sup.n, then only "n" searches are required to locate a data record pointer in any leaf node.
In practice, a simple binary tree is inefficient. Most data bases are stored on relatively slow storage devices, such as magnetic disks. The time required to access any item of data (such as a tree node) on such a storage device is dominated by the "seek" time required for the storage unit to physically locate the desired storage address. Following each seek, the contents of a node may be read into the high-speed memory of the computer system. In a simple binary tree, for each access of a node, only a two-way decision (to the left or right branch from that node) can be made since the node contains only one search key.
If instead of containing only one search key per node, a node contains several search keys, then for each seek operation, several keys will be read into the high speed memory of the computer system. With one search key per node, a comparison and determination can be made that the item sought for is in one half of the remainder of the tree. With "n-1" search keys per node, the search can be narrowed to "1/n" of the remainder of the tree. This type of structure is known in the prior art as a "multi-way" tree.
It is advantageous to have as many search keys as possible per node. Thus, for each seek of a node, several search keys can be examined and a more efficient determination can be made as to the location of the next node or, in the case of a leaf node, of a data record. The height of the tree, and hence the search time, is dramatically decreased if the number of search keys per node is increased.
In many prior art systems, a number of complete search keys, along with their associated pointers, are stored in each node of a search tree. For example, in the IBM System/34 computer, each node is 256 bytes long, corresponding to that computer system's magnetic disk sector size. In this example computer system, the maximum key length permitted is 29 bytes. Using 3-byte relative record numbers for pointers, the maximum number of search keys that can be inserted into each node of that system is eight. Thus, for that computer system, it would be very advantageous to devise a search tree structure that contained more than eight search keys per node.
The present system provides just such an improved tree structure, using a variation of the B-tree called a "Bit-tree". A Bit-tree is similar to a B-tree in that it consists of leaf nodes and branch nodes, with one of the nodes in the tree being the root node. In the present invention, branch nodes are essentially identical to branch nodes in a standard B-tree. (In the preferred embodiment of the invention, the root node cannot be any larger than any other branch node.) For the sake of example only, the inventive Bit-tree system is described in terms of its implementation on an IBM System/34 computer Thus, each node is 256 bytes long, and the inventive system uses 3-byte relative record numbers for pointers. Thirteen bytes per node are used for system information purposes. The remaining 243 bytes of each node can be used for search keys and their associated relative record numbers. If "k" is the length of a search key, then the maximum number of search keys per node is 243/(k+3). The maximum number of maximum length search keys per branch node is therefore seven (k=29 bytes).
The principal difference between the inventive Bit-tree and standard B-trees is the use of "distinction bits" in place of search keys in all leaf nodes. A distinction bit is determined by comparing two search keys, and calculating the ordinal number of the first bit that is different between the two keys. (In the preferred embodiment, the binary number "1000" is added to each distinction bit in order to simplify the search method). In the example under consideration, the maximum search key length permitted is 29 bytes, and since there are eight bits per byte, the maximum length of a search key is 232 bits. Thus, the ordinal number representing any one of those 232 positions need only be eight bits, or one byte, in length (even taking into account the 8-count displacement added into each distinction bit).
In each leaf node, instead of search keys, distinction bits along with their associated relative record numbers are inserted. In the example computer system, the maximum number of distinction bit entries plus relative record numbers in a leaf node is therefore 243/(l+3), or 60, regardless of the length of the actual key itself. This use of distinction bits is the principal advantage of Bit-trees. Since almost all nodes in a tree structure are leaf nodes, and since Bit-tree leaf nodes contain more entries than nodes containing standard search keys, there are fewer nodes in the tree to seek and read. Further, less storage space is required for the tree itself, since more information is packed into fewer leaf nodes. Thus, a computer system using the present invention for a search tree structure is significantly more efficient than prior art B-tree search tree structures.