Data processing applications allow their users to create, change, modify, and delete files over time. A file version represents a particular iteration of a file at a point in time. Such iterations can be the same or can be different from the originally created file and/or from its other versions. Some files may have no versions (i.e., only a single original file), only a few versions, or a plurality of versions. An efficient way to store versions of files or segments of files over time is by delta compressing versions against each other and storing them in a version chain. Version chains are typically linear data structures that hold contents of versions of the same or similar files or segments of files over time. For example, a segment that is originally created and then modified four times can have a version chain consisting of a head and four versions, which would represent a version of the file or file segment at five different points in time.
To reduce storage space, file versions are typically stored in a compressed format, such as a delta-compressed format. Delta-compression or delta-encoding is a way of storing or transmitting data in the form of differences between versions of a file or file segment rather than complete files. The differences are recorded in discrete files, which are called “patch files” or “deltas.” There are many conventional applications that employ delta-compression or delta-encoding. Examples of these include source control systems, backup data deduplication systems, content management systems, software distribution systems, and/or any other systems.
In some cases, the version chains can be represented as linear reverse delta version chains, where the most recent version is stored in its whole form and all earlier versions could be stored as difference/delta files from each other in the linear chain. Conversely, a forward delta version chain maintains the original version of a file in its whole form, and creates delta files forward from that first version.
While a linear arrangement of delta versions can be one of the simplest data structures for version chains, there are operations on version chains that make the linear arrangement of deltas inefficient, more prone to data loss, and/or cumbersome, as indicated below. One of these operations includes accessing an earlier version of a file, which is a linear process whose processing time is directly proportional to the position of that version along the length of the version chain. The shorter the distance from the most recent version to the desired version within the reverse delta version chain, the faster the execution time to recreate that earlier version. However, this operation can consume a greater amount of time and processing power as the distance from the most recent version to the desired version increases.
Another operation includes deleting a single delta version from anywhere in the version chain except the ends of the chain. This can require decompressing of all more recent versions of the version to be deleted in order to remove that version and reconnect its two adjacent versions to each other. This can again be a time-and-processing intensive operation.
If a delta version within a version chain is determined to have become corrupted, all earlier versions are rendered unavailable since their regeneration is based on all of the more recent versions to be error free. Hence, there is a need to reduce the probability of data loss by significantly reducing the number of deltas that must be error free in order to successfully restore an earlier version of a segment or file.