Computer systems employ applications that update data from time to time, typically in part. That data is then typically stored, perhaps first to a repository, such as memory or disk, and subsequently to data storage media, such as removable media, examples of which comprise magnetic tape, optical disk, magnetic disk cartridges, memory cartridges, etc. The storage to a repository and to data storage media may be called backup of the data and is conducted by a backup/restore application, as is known in the art. For example, a user or group of users may wish to periodically (e.g., daily or weekly) backup the data of a particular application, or all of the data stored on their computers to a repository as precaution against possible crashes, corruption or accidental deletion of important data.
The partial updates to data streams may result from use of only a part of the data stream by each of various updating applications. In one example, one application or user will use and update one or more data sets or virtual volumes which comprise a portion of the data stream, while another application or user will use and update one or more data sets or virtual volumes which comprise another portion of the data stream. In either case, often only a small part of the data sets or virtual volumes in the data stream being backed up have been updated, and therefore much of the current data can already be found in the repository with only minor changes.
A process to reduce the amount of identical data stored in the repository is called data deduplication, and various techniques are known to those of skill in the art. The net result of data deduplication is that, for portions of the data that are identical, one copy of that portion of the data is stored as a first copy, and other copies are replaced by pointers to the first copy.
Herein, each of the first copy and each of any pointers to the first copy for a given portion of data is called a “reference” for deduplication for that data. The portion of the data that is unique, without identical copies, is also called a “reference” for deduplication for that data.
A deduplicated virtual volume may thus comprise a combination of data blocks that are unique and data blocks that are either first copies or are pointers to first copies. Depending on the technique employed to create the deduplication, the data blocks may be of uniform or variable size.
Deduplicated data is typically stated as being “backed up” in deduplicated form to the repository, and is typically stored on hard disk drive systems, such as RAID, as is known to those of skill in the art. A RAID system employs parity systems to insure that the data is not lost even though a substantial portion of the data may become corrupted, etc. The data may be formatted to emulate magnetic tapes or other form of removable media, but is arranged on the hard disk drive system in such a manner that the original data may be restored quickly. The data stream comprising the virtual volumes may exist as a complete original data stream and be deduplicated as it is backed up to the repository; or may be deduplicated and stored, for example, in temporary storage in deduplicated form, and then backed up to the repository.
The repository itself must be backed up from time to time in order to avoid excessive costs, and the backup is typically to actual removable media, for example, a magnetic tape library. Access to the data is typically required for restoration of the original data, and a library maintains the removable media for quick access, although less quick than that of a disk drive system. When the data is transferred to physical tape, it can be reconstructed for the transfer, in which case it expands and consumes a great deal of tape, for example, the expansion may be by a factor of 10 or 20.