Field of the Invention
The present invention relates generally to the field of computing. Embodiments of the present invention relate to a method for efficient merging, storage and retrieval of incremental data.
Discussion of the Background
As datasets continue to increase in size, mechanisms are needed to efficiently archive and retrieve data. One such mechanism called snapshots consists of (a) taking a base or initial archive of the data, (b) creating a list of modifications to the data within a period of time (called an epoch), (c) storing the snapshot, which consists of an index (that includes the list) and the corresponding modified data, (d) resetting the list to make it empty, and (e) incrementing the epoch and starting the next epoch with step (b). This sequence of actions is depicted in steps 102-110 of the flowchart 100 of FIG. 1.
Data is typically stored in multiples of some fixed granularity. For instance, data is organized as fixed sized sectors on a magnetic disk medium or as a multiple of the sector size in a file system. From hereon, for simplicity, the fixed granularity will be referred to herein as a page. A snapshot thus consists of two components (a) an index, which enumerates the list of pages that have been modified in the epoch and (b) the modified pages.
A base archive and a set of snapshots can be used to restore the data to a desired point in time. This is achieved by (a) making a copy of the base archive, (b) locating all epochs that fall within the desired restoration period, and (c) starting with lowest epoch, sequentially applying the snapshots that correspond to each epoch to the copy of the base archive. Since a snapshot consists of an index, which lists the modified pages, applying a snapshot involves overwriting the page in the copy of the base archive with the corresponding page in the snapshot. At the end of this series of operations, the copy of the base archive reflects the contents of the data at the corresponding point in time. This sequence of actions is depicted in steps 202-206 of the flowchart 200 of FIG. 2.
While the above approach to restoring data from a series of snapshots works, it is expensive. Correctness requires that all snapshots within the restoration period be applied sequentially starting with the lowest epoch to the highest epoch. This is required in order to ensure that pages modified multiple times in different snapshots have the correct data. Hence a consistent view of the data can only be obtained after all snapshots have been applied thereby creating a new base image. This mode is sometimes referred to as offline access, since it requires the creation of a new base image to obtain a consistent view of the data.