Data processing applications allow their users to create, change, modify, and delete files over time. A file version represents a particular iteration of a file at a point in time. Such iterations can be the same or can be different from the originally created file and/or from its other versions. Some files may have no versions (i.e., only a single original file), only a few versions, or a plurality of versions. An efficient way to store versions of files or segments of files over time is by delta compressing versions against each other and storing them in a version chain. Version chains are typically linear data structures that hold contents of versions of the same or similar files or segments of files over time. For example, a segment that is originally created and then modified four times can have a version chain consisting of four versions, which would represent a version of the file or file segment at four different points in time.
To reduce storage space, file versions are typically stored in a compressed format, such as a delta-compressed format. Delta compression or delta encoding is a way of storing or transmitting data in the form of differences between versions of a file or file segment rather than complete files. The differences are recorded in discrete files, which are called “deltas.”
In some cases, the version chains can be represented as linear reverse delta version chains, where the most recent version is stored in its whole form and all earlier versions could be stored as difference/delta files from each other in the linear chain. Conversely, a forward delta version chain maintains the first version of a file in its whole form, and creates delta files forward from that first version.
While a linear arrangement of delta versions can be one of the simplest data structures for version chains, there are operations on version chains that make the linear arrangement of deltas inefficient, more prone to data loss, and/or cumbersome, as indicated below.
One of these operations includes accessing an earlier version of a file, which is a linear process whose processing time is directly proportional to the position of that version along the length of the version chain. The shorter the distance from the most recent version to the desired version within the reverse delta version chain, the faster the execution time to recreate that earlier version. However, this operation can consume a greater amount of time and processing power as the distance from the most recent version to the desired version increases.
Another operation includes deleting a single delta version from anywhere in the version chain except the ends of the chain. This can require decompressing of all more recent versions of the version to be deleted in order to remove that version and reconnect its two adjacent versions to each other. This can again be a time-and-processing intensive operation.
If a delta version within a version chain is determined to have become corrupted, all earlier versions are rendered unavailable since their regeneration is based on all of the more recent versions to be error free. Hence, there is a need to reduce the probability of data loss by significantly reducing the number of deltas that must be error free in order to successfully restore an earlier version of a segment or file.
Version chains are typically arranged in a linear format. Version chains can also be implemented in a binary tree data structure to reduce the overall time in accessing earlier versions. However, if a primary goal of a version chain is to minimize data storage capacity, and it is assumed that two versions adjacent in time can produce a smaller delta file than two versions separated by a larger period of time, then a binary tree version chain can produce suboptimal data storage compression.
Thus, there is a need for a system and method for storing data that involves an improved delta version chain data structure, where the structure can be configured to mitigate various issues with the linear and binary-tree structures discussed above.