In the current art there are solutions that offer backup systems that are designed to handle the backup of multiple clients. Each backup client can define a single or multiple backup sets, where a backup-set is a predefined collection of files and folders to be backed up during a backup session.
A data set is the basic unit of data for which the incremental change is recognized by the backup system. Backup systems of the current art can recognize an incremental change on a file level, file fraction level (block level) such as a predefined 4K blocks, or a change to the basic physical storage unit (allocation unit). These systems will copy every data set within the backup set during the first backup session, and during subsequent backup sessions will copy only the data sets that have changed since the last backup run. This method reduces the amount of required storage space and communication bandwidth.
Another technique that is used to further reduce the amount of communication bandwidth and storage space requirements is to store on the backup destination a single copy of each unique data set content, which will be referred to in this document hereafter as stored data element. Each such data set serves as the backup copy for every data set that has an identical content. The identical data sets content can belong to the data set located on the same backup set, or they may belong to data sets located on different backup sets that are either located on the same computer or they can be located on different computers. In the terminology of this document hereunder each data that is stored on the backup system to serve as the backup copy of a data set will be referred to as stored data element.
In an ordinary backup system, in each backup session every data set that belongs to the backed up backup set is copied to the backup storage. In this kind of backup systems there is no problem to reconstruct the backup set, since in every backup session every data set is backed up and the data sets preserve their original relative position on the backup set (directory structure). However, some incremental backup systems of the current art store a data element only for a data set that has changed since the previous backup session, and some as described earlier will also share a stored data element. Therefore, the structure of the backup set cannot be recovered from the actual copied data sets. Hence, for each backup session a full inventory of the backup set is produced and is sent to the backup system as the meta data of the backup session.
These backup set inventories include several parameters that define the data set position and content during the backup session. These parameters include for each data set, the data set address within the backup set and a unique signature that represents the content of the data set with a smaller amount of data (signature). In a case where the incremental backup is done on a file level, the address will include the path to the file. If the incremental backup is done on a block level the above-mentioned address will include the path to the file and the block position within this file. When the incremental backup is done on the basic physical storage unit, the above mention address will include the path to the file and information such as plate, track and sector location where the data set is located.
Recently the backup market presents a strong demand to perform very frequent backup sessions, so if a misfortunate event strikes—the amount of lost data will be minimal. Market led requirement demands to hold for every backup set several backup snapshots on the backup system (second tier storage) before they will be deleted or removed to some longer-term archive (third tier storage). (Each backup snapshot is referred to in this invention as a backup revision.). This is required in order to enable a fast restore from a choice of several backup revisions. Each backup session produces a backup revision that is stored on the backup system. The collection of backup revisions that were taken for a specific backup set and are saved on the second tier storage is considered a ‘backup group’.
A life cycle management of the stored data is required in order to keep the second tier storage space from growing endlessly. Therefore a backup revisions retention strategy should be employed. This strategy necessitates the expiration of a backup revision from the second tier storage according to the backup revisions retention strategy. The expired backup revision will have to be deleted from the second tier storage, and in some cases will have to be copied as well to a third tier storage. In most retention strategies, after taking several backup sessions for a certain backup group there will be a need to expire some older backup revision after each new backup session is taken. This is needed to keep the second tier storage space from growing endlessly
If for example a backup session needs to be taken for a certain backup set in 30 minutes intervals, and the backup revision retention strategy is set to hold the last 20 backup revisions, then after 10 hours the backup system will have to expire the oldest backup revision whenever a new backup session is taken. During such backup revision expiration process, there is a need to locate the stored data elements that are no longer needed by any of the other non-expired backup revisions that are stored on the second tier storage. This means that on average the backup system will be engaged in each backup session with both accepting the new backup revision, and with expiring an older backup revision from the second tier backup destination.
In an ordinary backup system that backs up the entire data of a backup set in each backup session, there is no problem to identify the files that can be deleted when a certain backup revision is expired. This is because each backup revision has its own storage place on the backup destination, and no other backup revision depends on data backed up during another backup session. However, in the incremental backup system of the current art not every data set content that exists on the backup set is copied to the backup destination during each backup session, and stored data elements that were backed up during a certain backup session could be needed for restoring other backup revisions. As a result of that, it is not simple to locate the stored data elements that are no longer needed to sustain the non-expired backup revisions, and therefore can be deleted.
When the backup system should expire a certain backup revision that is located on the second tier storage, either because of a predetermined retention schedule, or because of an explicit user request, the stored data elements that are exclusively needed by the expired backup revision should be identified as redundant data elements. The redundant stored data elements can then be deleted from the second tier storage to free storage space, or deleted and further archived in another storage (third tier storage).
To implement a solution for this problem, the backup system should check whether every data set that is referenced in the expired backup revision's backup set inventory, exists in any of the full backup set inventories that belongs to the other non-expired backup revisions. Only data sets that have a unique content can have their stored data element deleted from the second tier storage, as they are exclusively needed by the backup revision that is getting expired. This is a very heavy operation that soon becomes a serious bottleneck that limits the backup frequency and the number of data sets that can be backed up by the backup system.
To exemplify the enormity of this task we can look at a medium size backup server that stores 100 backup groups that each holds 10 backup revisions and each backup revision backs up 10,000 data sets on average. That means that it holds 10×100=1000 backup revisions. Then, when a certain backup revision should be expired, and the stored data elements that no longer are needed by any of the remaining backup revisions should be deleted, the backup system should check whether each one of the 10,000 data set content that belongs to the expired backup revision exists in any of the remaining 999 backup revisions by comparing its signature to each one of the 10,000 data set signatures of each backup revision. This will give us 10,000×999×10,000=99,900,000,000 operations. If the backup set inventory is sorted, it will reduce the number of operations to 10,000×999×log 10,000=10,000×999×13.3=132,867,000 operations, which is still enormous load. Backup system of the current art do not detail the method in which they discard of backup revisions, and they usually suggest to run a ‘clean’ cycle during non-busy hours.
Reference to existing patent that can further enlighten the current art relevant to our invention include U.S. Publication number US2003/0182301 A1 Sep. 25, 2003, Patterson et al., and U.S. Pat. No. 5,778,395 Jul. 7, 1998 Whiting et al.