1. Field of the Invention
The present invention relates to data base management, and in particular to computer systems and methods for indexing and retrieving individual records stored in a data base.
2. The Prior Art
Over the last two decades advancement in the art of computerized information stoage and retieval have significantly expanded man's capability for efficiently accessing information. Improvements in integrated circuit manufacturing technology have made it possible to greatly reduce the size of computer components while at the same time expanding storage capacity. These advancements in technology have lead to what some have called an "information explosion," since now more than ever before vast resources of information are increasingly available through access to large computer data bases. However, in reality this "information explosion" is more accurately a "data explosion" since it is the amount of data (i.e., computer-stored information), more than the information or knowledge itself, which is rapidly increasing. Thus, with the rapidly increasing improvements in integrated circuit technology and compunter design, one of the significant challenges facing the computer industry has been the development of data base management systems which permit efficient utilization of a computer-stored data base.
The efficiency of computerized information storage and retrieval systems is directly related to how efficiently the data base can be searched and how quickly records from the data base can be retrieved. Especially where a data base is massive, comprehensive indexing of the data base is normally so difficult and time consuming that it is simply not done. Large amounts of computer time may be taken up in having to search an index to find data corresponding to a given request. This is especially so where the data base and data base index require storage on a secondary storage device such as a hard disc drive, which typically works a thousand times slower than a central processing unit. The result is that in computer systems where a massive data base is stored, typically the computer system is "I/O bound." That is to say, the speed of the system is limited by the speed of the disc drives. The CPU is not used to its full capacity because it must wait for long periods of time for the information to be found and retrieved before it can be processed.
On approach which has been used in the prior art to retrieve information from a computer-stored data base is the use of lists. Such lists are generated by searching through the data base for classes of information corresponding to a given attribute. Using this technique, long lists of information are generated, each list corresponding to a particular attribute. Since some of the information in the data base may correspond to more than one attribute, data is often redundantly stored in several lists. In order to then retrieve information which corresponds to a given combination of attributes, the computer must compare each item in one list with each of the items in the other lists. This approach is not particularly efficient because it requires substantial storage capacity just for the lists, as well as resulting in slow retrieval times.
Other prior art approaches which are used for information storage and retrieval have attempted to render the use of lists more efficient by combining pointers with each list. In this approach, each item in a generated list is provided with an encoded instruction which points to the location of the next item whether it is in the same or an adjacent list. This may somewhat help to eliminate redundant storage of information, and may therefore help to reduce to some extent the size of the index, but searching the index and retrieval of information using this type of approach is still difficult and greatly slows down the retrieval and accessing of information from the data base.
For example, in a data base having about 80 million records with each record requiring approximately one kilobyte of storage, the entire data base will require approximately 80 gigabytes of storage capacity. Using present state-of-the-art technology, storage would typically be provided on approximately one hundred eighty one-half gigabyte hard disc drives. Significantly, using the prior art list or list and pointer type indexing systems, the index to the data base would itself be so large that it would require 80 to 90 gigabytes of storage capacity. Thus, because of the size of the index its use is prohibited and typically information would be accessed only by a single, very specific key. For example, if the data base consisted of genealogical records, to identify an individual one would typically have to know the full name, place and birth date of the individual. As can be appreciated, it would be much more advantageous to be able to do research using less specific keys. However, that becomes extremely difficult due to the large size of the data base and the long retrieval times associated with searching and accessing information in the data base.
More recently attempts to solve the problem of efficient data base management have lead to the development and use of hierarchal trees. See, for example, U.S. Pat. No. 4,318,814 issued Mar. 2, 1982, to Millett et al. In this type of system a hierarchal tree is developed consisting of various levels, each level having a plurality of nodes. The nodes at each level are used to classify the attributes which are to be used for searching the records of a data base. At the terminal nodes of the tree, i.e. those points in the tree where the branches terminate, typically a group of data base records are identified. Such records correspond to the classification for the attributes represented at each preceding node back through the hierarchal structure of the tree. By traversing the tree one may choose different attributes at each level and may assign a "1" or a "0" bit at each node of the tree, depending upon whether the particular attribute for that node is selected. In this manner, each time the tree is searched for a particular combination of attributes a specific bit string may be generated representing one or more paths of the tree. The bit string corresponding to the searched attributes may then be stored for later reference and may be combined with the bit string for other attributes when attempting to identify records corresponding to particular combinations of attributes.
This particular method has proven to be much more efficient than the standard list or lists and pointer approaches. However, the use of such a hierarchal tree containing logically classified attributes is not without its limitation. One of the problems which arises when using this approach is that its efficiency is reduced as the size of the data base begins to grow. Since each terminal node of the hierarchal tree typically represents a group of records, as additional records are added at the terminal nodes one may find that eventually the group of records identified at the terminal nodes of the tree becomes so large that it become difficult to search that group of records, even though the group itself can be located fairly quickly using the structure of the hierarchal tree to search the logically classified attributes that are used to identify that group of records.
Another problem is that as the size of the data base grows, it may be necessary to include additional attributes which means that the classification scheme represented by the tree must be restructed to accommodate additional attributes. This process is difficult as well as requiring fairly large amounts of storage capacity to actually store the hierarchal tree containing the logically classified attributes. Accordingly, while the type of index which relies upon a hierarchal tree having logically classified attributes associated at each level of the tree may work well for a data base which is not too large, which does not require rapid growth or frequent modification of the classification scheme, it nevertheless has definite limitations when used with a massive data base or one in which the number of records of the data base is growing at a rapid rate.
Increasingly, data bases are becoming large and are growing at ever faster rates. For example, in some cases it is not uncommon for a data base to increase on the order of several million records per year. Accordingly, what is needed in the art is a computer system and method which can be used to more efficiently index a massive and rapidly growing data base so that records can be randomly added to the data base without having to restructure the index and where the index can be structured so as to require relatively little storage capacity and yet still permit highly flexible and rapid searching and retrieval of individual records.