The invention relates to a method of and apparatus for merging a sequence of delta files. In the sequence, the delta files define a series of changes between a base file and an updated file.
Co-pending British patent application no. 9817922.9, the teachings of which are incorporated herein by reference, describes a method of producing a checkpoint which describes a base file and a method of generating a difference file defining differences between an updated file and a base file. A checkpoint which describes a base file is produced by dividing the base file into a series of segments; generating for each segment a segment description; and creating from the generated segment descriptions a segment description structure as the checkpoint. The segment descriptions represent segments of the base file at a minimum level of resolution sufficient to represent distinctly the segment. A difference or delta file which defines differences between an updated file and the base file is produced by generating at different levels of resolution segment descriptions for segments in the updated file and comparing the generated segment descriptions with segment descriptions in the checkpoint to identify matching and non-matching segments. Data identifying segments in the updated file that match segments in the base file and data representing portions of the updated file at a minimum level of resolution sufficient to represent distinctly the portion are stored as the delta file.
Merging delta files by applying the first delta file to the base file, then applying the next delta file to the created file and so forth requires a significant amount of a backup repository""s CPU time. This is especially so when the number of delta files reaches several thousands and a reconstruction of the latest version is required on a daily basis.
One existing method of merging delta files is the iterative build method which successively merges a base file with one or two delta files, thus writing and reading a base file once for every delta file a base file is merged as with. When merging N delta files with a base file, the number of I/O operations is equal to
(2N+1)xc2x7[number of bytes in the base file]
An approach to overcome the poor performance of iterative build methods is disclosed in U.S. Pat. No. 5,745,906 to Squibb. This specification describes a method which is said to use 2xc2x7[number of bytes in the base file] I/O operations to merge a sequence of delta files. However, Squibb does not include the number of I/O operations required by the search requests used to merge delta files. Squibb""s method processes the delta files starting with the latest one and initiating search requests in previous versions. This approach cannot be used in an environment where delta files can only be provided in the order they were created in, e.g. backup repositories on magnetic tapes. Furthermore, even 2xc2x7[number of bytes in the base file] I/O operations may be impractical in situations where big files (e.g. database files) have been backed up several times, but have only changed very little between each backup.
Squibb uses a one Gigabyte database with a typical 0.5% change as an example to point out the advantages of his algorithm. Such a database would typically lead to 5.1 Megabyte delta files with the method described in our aforementioned co-pending patent application. Thus, merging 50 such delta files would require at least 2 billion I/O operations using Squibb""s method. Using the methods to be described hereinbelow would only require about one eighth of Squibb""s 2 billion, i.e. 255 million I/O operations.
The invention aims to provide a method and apparatus which reduces the number of operations carried out by the CPU at a backup repository.
As daily or even more frequent backups of files soon produce big numbers of delta files, each corresponding to one version in the history of the file, normally most of the versions become obsolete. However, the obsolete delta files cannot just be deleted, as the information contained in them might be needed to reconstruct later versions. Therefore, to save time when reconstructing a version and space in the backup repository, a method to combine a sequence of delta files to create one delta file is needed. The invention also aims to provide an efficient apparatus and method for deleting a sequence of delta files and replacing them with one combined delta file without losing the capability to restore later versions.
Another aim of the invention is to provide a method and apparatus to merge delta files in a way that only needs the presence of one delta file at any time and processes the delta files in the order they were created.
The invention further aims to provide a method of merging delta files in such a way that the number of I/O operations needed is equal to the sum of the bytes in the delta files being merged.
The invention aims to provide a method of maintaining tokens in delta files in such a way that a restore operation only requires those delta files that hold unique bytes actually used in the version to be restored. The same method can be used for deleting intermediate versions. This enables a data structure filled by information from the first of a sequence of delta files to be created. The data structure is then updated by consecutively reading and analysing the remaining delta files. The data structure might be held in a disk file or, for additional performance, in memory. The merging the sequence of delta files may be used to create one or more bi-directional delta files which allow reconstruction of updated files with reduced I/O operations, or even bi-directional delta files without any references to previous delta files. Bi-directional delta files allow reconstruction of a base file given the updated file and the special bi-directional file.
The merging of a sequence of delta files is broken down to merging two data structures by replacing all references to a previous delta file with a reference to the actual location where the bytes in question are held or by replacing those references with the bytes in case of using the method and apparatus in an environment where multiple accesses of delta files are not practical or impossible (e.g. backup repositories held on magnetic tapes).
The invention aims to provide a method where the tokens associated with a delta file hold information about the particular version the unique bytes involved are held in. This enables restores and deletes to access only those delta files that are needed for the operation without having to scan through other delta files.
According to one aspect of the invention there is provided a method of merging a sequence of delta files that together define a series of changes between a base file and an updated file, each delta file defining one or more changes in terms of one or more unique tokens each identifying original data or of one or more reuse tokens identifying data reused from the immediately preceding delta file or the base file, the method comprising: creating an initial merge structure from the base file and the first delta file in the sequence; creating a further merge structure from the initial merge structure and the next delta file in the sequence by comparing tokens in the initial and further merge structures and replacing reuse tokens in the further merge structure with tokens in the initial merge structure; replacing the initial merge structure with the further merge structure so that the further merge structure becomes the initial merge structure; and repeating the operations of creating a further merge structure and replacing the initial merge structure with the further merge structure, for all delta files in sequence order, whereby the thus created merge structure represents all changes between the base file and the updated file.
According to another aspect of the invention there is provided a method of creating a current file from an initial file and a set of difference files that defines a sequence of changes between the initial file and the current file, the method comprising: merging the difference files to remove redundant information therefrom and thus create a changes file representing all changes to be applied to the initial file in order to arrive at the current file; and modifying the initial file using the information in the changes file.
According to a further aspect of the invention there is provided an apparatus for merging a sequence of delta files that together define a series of changes between a base file and an updated file, each delta file defining one or more changes in terms of one or more unique tokens each identifying original data or of one or more reuse tokens identifying data reused from the immediately preceding delta file or the base file, the apparatus comprising: means for creating an initial merge structure from the base file and the first delta file in the sequence; means for creating a further merge structure from the initial merge structure and the next delta file in the sequence by comparing tokens in the initial and further merge structures and replacing reuse tokens in the further merge structure with tokens in the initial merge structure; means for replacing the initial merge structure with the further merge structure so that the further merge structure becomes the initial merge structure; and means for repeating the operations of creating a further merge structure and replacing the initial merge structure with the further merge structure, for all delta files in sequence order, whereby the thus created merge structure represents all changes between the base file and the updated file.
According to another aspect of the invention there is provided an apparatus for creating a current file from an initial file and a set of difference files that defines a sequence of changes between the initial file and the current file, the apparatus comprising: means for merging the difference files to remove redundant information therefrom and thus create a changes file representing all changes to be applied to the initial file in order to arrive at the current file; and means for modifying the initial file using the information in the changes file.
The above and further features of the invention are set forth with particularity in the appended claims and together with advantages thereof will become clearer from consideration of the following detailed description of an exemplary embodiment of the invention given with reference to the accompanying drawings.