1. Field of the Invention
The present invention relates to storing information and, more particularly, to a tree configuration by which an indexing data base is stored in streaming memory and accessed.
2. The Prior Art
In many computer applications, large amounts of information must be stored and accessed. Generally, during the process of deciding how this information is to be stored, a tradeoff must be made between time and memory. The time variable includes the amount of time necessary to store information, to locate a particular piece of information, and to recreate the information once located. The memory variable includes the amount of memory necessary to store the information and to store and execute the software necessary to store, locate, and recreate the information.
There are actually two time/memory issues related to storing information, the first issue being how the information itself is stored in an information data base and the second issue being how a particular item of information is found within the information data base. The simplest way to store information is linearly, that is, information is stored in data memory as it is received and is not modified or compressed in any way. In such a system, a given amount of information occupies a proportional amount of data memory. The main advantage of such a system is that the amount of time needed store the information is minimized. The main disadvantage is that data memory requirements and the time needed to retrieve the information grow in direct proportion to the amount of information stored.
The simplest way to find a particular item of information is to linearly search the entire information data base for the item until it is found. This method is advantageous in that it is simple to implement, but the amount of time needed to find particular information is unpredictable in the extreme and the average time to find a particular piece of information can be unduly great.
An alternate method for finding information is to use a keyword data base, also called an index. The index is stored in memory separate from the information data base. Each keyword of the index includes a set of pointers that points to one or more locations in the information data base that correspond to that keyword. Thus, rather than searching a large information data base for particular items of data, an index is searched for keywords, typically greatly reducing the search time.
The simplest index structure is an alphabetic list of the data items, with each item followed by the appropriate pointers. The disadvantage of such a structure is that, in order to find any particular data item, the list must be searched from the beginning, leading to a potentially long search time. There are ways of decreasing the search time, such as by fixing the size of each index entry or by creating another list of pointers to each item in the index. Each increases the amount of memory needed, requiring a time/memory tradeoff.
An alternate structure for decreasing search time of an index data base is a tree structure, which consists of a group of related nodes, each node containing a subset of the stored data items, where the relationship between the nodes defines the data items. Each unique data item is stored as a set of linked nodes. In one tree structure, such as that described in U.S. Pat. Nos. 5,488,717 and 5,737,732, common parts of data items are combined into single nodes, followed by nodes containing the unique parts of the data items. The node containing the first part of the data item is called the root node and is generally common to more than one data item. The node containing the last part of the item is called the leaf node and is unique for each data item. The data base is searched from the root node for a known data item. When the search reaches the leaf node for that data item, a pointer or other identifier in the leaf node is used to locate the data item in the information data base.
The memory in which the index data base is stored has two forms, primary storage and secondary storage. Primary storage is typically the local random-access memory, or RAM. Secondary storage is typically a disk drive or other mass storage device. The significant differences between the two are that primary storage is much smaller and much faster than secondary storage. For example, current personal computers typically have 64 Mbytes of RAM and 10 Gbytes of disk storage space, a factor of 200 difference. Further, the time it takes to access the RAM is, on average, more than 100,000 times faster than to access the disk.
The typical index data base that is so large that linear storage is out of the question is also far too large to fit completely into primary storage. Consequently, most of the data base resides on disk, which means that the vast majority of search time is taken up by reading data from disk, not by the processing time needed to find the object of the search.
Additionally, most tree structures require that new data items be added individually, which means that the vast majority of inversion time, the process of adding data to a tree, is taken up by writing the updated tree to disk following each data item inversion. Typically, a trade-off must be made between inversion time and search time. When selecting a tree structure of the prior art, one must decide whether inversion time or search time is to be minimized because none of the tree structures of the prior art provide both fast inversion time and fast search time.
Thus, there continues to be a need for a data structure for indexes that provides for heavy concentration of data, rapid and predictable information location times, rapid inversion times, and that is easily adapted to the physical structure of secondary storage media.
An object of the present invention is to provide a data base structure that reduces secondary storage accesses during both the process of adding data items and searching for data items.
Another object is to provide a data base structure that minimizes secondary storage accesses while maximizing storage usage.
The essence of the streaming metatree (SMTree) of the present invention is its data storage structure. A typical prior art data structure uses randomly located nodes that point to the next succeeding node of the data item. Thus, nodes are followed by pointers to find a particular data item, jumping randomly between primary and secondary storage. On the other hand, the SMTree structure of the present invention stores nodes in a logically linear fashion, hence the term xe2x80x9cstreamingxe2x80x9d. Pointers are used in the present invention, although not in the same way as in the data structures of the prior art. Physical computer memory is composed of fixed-size blocks, throughout which the SMTree is distributed. Since the blocks can be randomly located in a physical sense, pointers are required. However, the data structure is such that a memory block will only be traversed at most one time during a search. Information stored within the node indicates its relationship to the other nodes and within the tree hierarchy. The SMTree structure is particularly suited for indexing-type structures.
There are two basic embodiments of the SMTree, a xe2x80x9chorizontalxe2x80x9d embodiment and a xe2x80x9cverticalxe2x80x9d embodiment, and a hybrid embodiment that uses components of both. The horizontal embodiment is most preferred because it is more efficient for the vast majority of applications.
Logically, the SMTree is composed of lists of alternative nodes. The root alternate list is a list of all data units that begin data items in the SMTree. When searching for a data item in the SMTree, the appropriate data unit from the root alternate tree is found and followed to the next lower level alternate list. This continues until the leaf node for the data item being search is reached. Following the leaf node is at least one identifier that references external objects.
Note that every data unit is considered to be part of an alternate list, and that many data units are in alternate lists that have only one member. In such a case, groups of single member alternate lists are combined into single nodes. For example, if the SMTree contains the two data items, xe2x80x9cabbiexe2x80x9d and xe2x80x9cadamantxe2x80x9d, the lower level alternate list from the root alternate list member xe2x80x98axe2x80x99 will have as members xe2x80x98bxe2x80x99 and xe2x80x98dxe2x80x99. Logically, there will be three single member alternate lists following xe2x80x98bxe2x80x99, the first containing the member xe2x80x98bxe2x80x99, the second xe2x80x98ixe2x80x99, and the third xe2x80x98exe2x80x99. In order to save memory and processing time, the single member lists are combined into a node with the last member of an alternate list of more than one member. In this example, the nodes xe2x80x9cbbiexe2x80x9d and xe2x80x9cdamantxe2x80x9d are the two members of the alternate list following xe2x80x98axe2x80x99.
In the horizontal embodiment, the nodes and identifiers are stored linearly and sequentially in memory, from the first member of the root alternate list to the leaf node of the last data item, where the first physical sequence of nodes define the first data item of the SMTree. In order to determine where the nodes fit in the SMTree, each node has a header. The header includes (1) the size of the node so that it is known how many memory locations to skip if the node is not needed, (2) a flag indicating whether or not the node is a leaf node so it is known that the search is at an end, (3) a flag indicating whether or not the node is the last node of an alternate list, and (4) a flag indicating the presence of a memory block pointer. If the memory block pointer flag is set, the node is followed by a memory block pointer. The present invention includes a mechanism for dividing the SMTree into subtrees to fit into multiple memory blocks. This mechanism includes the memory block pointer flag, the memory block pointer, and a subtree header that includes a previous data unit value. When a memory block is jumped to, it must be determined which subtree in the block is the next to be traversed during the search. The previous data unit value contains the first data unit of the alternate list member node from which the jump took place. After the jump, these previous units are compared to determine the appropriate subtree to continue with. Implicit within this mechanism is that all subtrees in a block must be pointed to from members of the same alternate list, because this is the only way that the all of the previous units in a memory block can be guaranteed to be unique. On the positive side, it also guarantees that a block will only be traversed once during a search.
The present invention also provides two mechanisms for traversing backwards through the SMTree. In the first, each block has a head that includes a pointer to the higher-level block that contains the alternate list with pointers to the subtrees in the lower-level block. The subtree previous unit is used to determine which alternate list member points to the lower-level block. In the second method, each forwardly traversed block is pushed onto a stack, and are popped from the stack when traversing backwards.
The SMTree is traversed using the node header information. The search begins with finding the first unit of the data item to be search in the root alternate list. Nodes are skipped, using the node size value, to reach each member of the root alternate list until the matching member is found. The node in the stream following the matched node is the first node of the next lower-level alternate list. The same procedure is followed with this alternate list as with the root alternate list. If a matching node is a leaf node, but the searched-for data item is not yet found, then the searched-for data item is not in the data base. After the searched-for data item is found, the identifier(s) that follow the leaf node are used as necessary.
Two mechanisms are contemplated for inversion, the process of adding data items to an SMTree. In the first, each new data item is added directly to the main SMTree. In the second, new data items are added a temporary SMTree until the temporary SMTree becomes too large, and then the temporary SMTree is merged with the main SMTree. Merging two SMTrees is a matter of merging the root alternate lists of the two SMTrees.
In the trees of the prior art, data items are added individually directly to the main tree, necessitating secondary storage accesses for each new data item. Since the SMTree of the present invention stores new data items in a temporary SMTree in fast primary storage before merging with the main SMTree in slower secondary storage, substantial savings in inversion time is realized. And the savings is accomplished without a corresponding increase in search time. The structure of the SMTree provides additional savings in inversion time by being particularly suited to being operated on by multiple processors. The alternate lists of the two SMTrees to be merged are divided between the processors at a point in the alternate list where the secondary storage between the two parts of the alternate list does not overlap. Since these different parts of an alternate list are completely independent of each other, the processors can operate completely independently.
The basic mechanism consists of traversing the SMTree searching for the new data item until a data unit of the existing data item differs from the new data item. Then a new node containing the different unit and all subsequent units is added to the alternate list following the last common unit. The mechanism is actually much more complex due to the fact that the SMTree is spread among numerous memory blocks, so that the size of the subtrees must be taken into account. If a subtree becomes too large for the memory block, it must be split into two subtrees or moved to another memory block.
A data item is removed by searching for the leaf node of the data item to be removed, and removing nodes in reverse order up to and including the node of a multiple-unit alternate list. Optionally, any resulting single member alternate list can be compacted.
In summary, the present invention is a data structure adapted for use with a computer for storing data items in a data base, where each of the data items is composed of at least one data segment and each data segment is composed of at least one data unit including a first data unit. The data structure of the present invention comprises (a) a linear stream of nodes, each of the nodes containing a data segment, the nodes including predecessor nodes and successor nodes and being related by location in the stream; (b) progressions of nodes in the stream from a root alternate list node to a leaf node corresponding to progressions of data segments of the data items; (c) the progressions being traversed from node information included in each of the nodes, the node information including a size value indicating the number of data units in the node, a leaf node flag indicating if the node is a leaf node, and a terminal alternate flag indicating if the node is an alternate list terminal node; (d) each of the progressions being associated with at least one identifier that references at least one object external to the stream; (e) selected data segments of different data items in selected progressions of successor nodes being different for different data items; and (f) selected data segments of different data items in selected predecessor nodes being common to said different data items.