1. Field of the Invention
This invention relates to a file storage and retrieval technique for processing alphanumeric information that has particular advantages in implementing database queries.
2. Brief Description of the Prior Art
In the computer arts, data is typically stored in some form of non-volatile storage system, such as magnetic disks, in the form of data files. These files are subdivided into data records, which are subsets of the file itself. Processing is done within each file by accessing the data records. Users conduct transactions against a file, inserting, deleting retrieving and updating data records.
For very large data bases it is extremely inefficient and time-consuming to sequentially search all data records in a file in order to find a particular record to access, modify or delete, or to locate the appropriate place to add a new record.
A more efficient, but still cumbersome and time-consuming, search method requires creating a search key for each data record which uniquely identifies the record. Each search key is associated with a record pointer that indicates the location in the computer storage system of the data record associated with the search key. A common type of pointer is a relative record number. Through the use of such record pointers, the data records themselves need not be kept in sequential order, but may be stored in random locations in the computer storage system. A search for a particular data record is enhanced by sequentially searching a compiled index of such search key records (comprising search keys and record pointers), rather than the date records themselves. However, such sequential searching is still relatively slow.
A much more efficient method for such a key index is to create a “tree” structure, rather than a sequential file, for the key records. One such tree structure is a B-Tree, an example of which is shown in FIG. 1. The use of B-Trees to structure indexes for data files in computer storage systems is well known in the prior art. (See, for example, Knuth. The Art of Computer Programming, Volume 3, pages 473-479).
A B-Tree consists of nodes which can be either a root node, branch nodes or leaf nodes. A branch node contains-at least one search key and related pointers (such as relative addresses, node numbers) to other nodes. A leaf node contains at least one search key and a pointer to a data record. One node in the tree is the root node or starting point, which can be either a leaf node (only for a tree with a single node) or a branch node. The “height” of a tree is equivalent to the number of nodes traversed from the root node to a leaf node. Searching for a data record is accomplished by comparing a key to the contents of the root node, branching to branch nodes based on such comparisons, comparing the key to the contents of such branch nodes, and continuing “down” the height of the tree until a leaf node is reached. The key is compared to the contents of the leaf node, and one of the pointers in the leaf node is used to locate the desired data record (if one exists).
In the most simple B-Tree, see FIG. 2, each node contains one search key and two associated pointers. Such a tree structure, sometimes referred to as a binary tree, theoretically provides a very efficient search method. If the number of nodes in this type of tree is equal to or less than 2n, then only “n” comparisons are required to locate a data record pointer in any leaf node.
In practice, a simple binary tree is inefficient. Most data bases are stored on relatively slow storage systems, such as magnetic disks. The time required to access any item of data (such as a tree node) on such a storage device is dominated by the “seek” time required for the storage unit to physically locate the desired storage address. Following each seek, the contents of a node may be read into the high-speed memory of the computer system. In a simple binary tree, for each access of a node, only a two-way decision (to the left or right branch from that node) can be made since the node contains only one search key. If instead of containing only one search per node, a node contains several search keys, then for each seek operation, several keys will be read into the high speed memory of the computer system. With one search key per node, a comparison and determination can be made that the item sought for is one half the remainder of the tree. With “n−1” search keys per node, the search can be narrowed to “1/n” of the remainder of the tree (for example, with 9 search keys per node, a search can be narrowed to “ 1/10” of the remainder of the tree). This type of structure is known in the prior art as a “multi-way” tree. See FIG. 3.
It is advantageous to have as many search keys as possible per node. Thus, for each seek of a node, several search keys can be examined and a more efficient determination can be made as to the location of the next node or, in the case of a leaf node, of a data record. The height of the tree, and hence the search time, is dramatically decreased if the number of search keys per node is increased.
An even later development in tree structures is the B+-Tree which uses query values in place of actual search key values in the branch nodes and places all key values in leaf nodes, as shown in FIG. 4. (See, for example, D. Comer 1979, “The Ubiquitous B-Tree,” ACM Computing Surveys Vol. 11, No. 2, June, 1979, pgs. 130-131). This structure has all of the advantages of B-Trees as well as having a much smaller index for accessing for keys. B+-Trees use multiple query values in branch nodes, as shown in FIG. 5 and FIG. 7 as well as multiple keys per leaf node, as shown in FIGS. 6 and 7.
The state of the art appears to be in some similar form of the B-Tree and the B+-Tree as shown here as prior art.
The process of actually building a tree structure using any of the current methods results in branches becoming unbalanced based on the order in which keys are added and/or deleted from the data base. When the branches are out of balance more than 2n+1 comparisons may be required when searching for one particular key. These structures must be balanced regularly and never remain in a totally balanced condition. B+-Trees additionally may require new query values to be established while being balanced.
The present invention solves the problem of accessing data in a database. Since all words and/or selected phrases in the database are stored as concatenated strings of their ASCII characters, any word or phrase can be found with only one access. This requires systems to have very large indexes within a database which will be a result of computer systems in the very near future. This process may also be scaled back for use on today's system.