The need to merge data sets arises in a variety of different applications. For example, to update the content of an encyclopedia database installed on client computers, the client computers may periodically connect online (e.g., over the Internet) with a designated server to obtain additional data reflecting changes and additions to the original data in the encyclopedia. After the client computers download the updates, it would be desirable to provide a merged database that includes the original information in the encyclopedia database and the new information downloaded as updates. By providing a merged data set, a user of the encyclopedia program can efficiently search a single set of data and can browse the data in order, e.g., alphabetically. If the data sets are not merged, the user will be required to browse the updates separately from the original encyclopedia data.
One approach to solving this problem provides for directly integrating the data sets (i.e., the original data set and the one or more new data sets) to form a merged data set and is sometimes referred to as producing a physical merge of the data sets. In this approach, the data within each set are compared with the data in the one or more other sets to determine the relative ordered positions of each datum from all sets involved in the merge. The data from all sets are then interleaved and stored together in the final merged set in which all of the data entries are correctly ordered. This approach requires sufficient memory resources to store each original data set and the final merged set and a relatively fast processor to process the data in the sets. For very large ordered sets, the required memory may exceed the available memory. Furthermore, it may be impractical to merge the data sets on a computer that has the required memory resources and then transfer the resulting merged data set to a computer on which the merged data set will be used, but which doesn't have the required memory or processing resources. Communication of a very large merged data set to a remote site, such as from a server to a client computer over the Internet or other network, often requires a substantial amount of time, even with a relatively high bandwidth connection between the server and client. The required time will typically not be acceptable to a user on a client computer, particularly if the entire merged data set must be transmitted after a relatively smaller data set is merged with a substantially larger original data set that is already stored on the client computer.
Of course, if the client computer has the original data set and is provided the required memory resources to load the entire original set and the new data sets into memory, it may be necessary to transmit only the smaller new data set to the client computer from the server. The client computer can then perform a physical merge of the data sets. However, even such a local merge often requires an undesirably long time, because each datum in the new data set must be compared with data in the original data set to determine the correct position of the each new datum in the merged data set.
The second approach typically used to address this problem creates a meta-data mapping for the data of each set, and is sometimes referred to as creating a virtual merge, or a virtual database. The data within each smaller new set must still be compared with the data in the original larger data set to determine relative positions of each datum. However, the data in the original data set and each new data set are not stored together in a final merged set. Instead, a schema, or other mapping, is used to associate each datum with its relative position in the virtual merged data set. The mapping identifies the relative position of each datum in a virtual database. In this case, each data set is typically maintained in its original form, and the schema maps the original data to the virtually merged database. However, for very large ordered sets, the additional metadata required for mapping a virtual database may strain computing, memory, and communication resources as much as a physical merge. In the case of very large homogeneous data sets, which share the same data structure, it is desirable to update, or otherwise merge the data sets without a physical merge and without adding a complex schema of metadata to map all of the original data into a virtual database. A new approach is thus required that produces a virtual database in which the data sets are merged, while avoiding the problems noted above.