The invention disclosed herein relates generally to data storage in a computer network and more particularly to selectively copying data in a modular data and storage management system.
In the GALAXY storage management system software manufactured by CommVault Systems, Inc. of Oceanport, N.J., storage policies are utilized to direct how data is to be stored. Storage present the user with logical buckets for directing their data storage operations such as backup and retrieval. Each client points to a storage policy that allows the user to define how, where, and the duration for which data should be stored at a higher level of abstraction without having to have intimate knowledge or understanding of the underlying storage architecture and technology. The management details of the storage operations are transparent to the user.
Storage policies are thus a logical concept associated with one or more backup data sets with each backup data set being a self-contained unit of information. Each backup data set may contain data from multiple applications and from multiple clients. Within each backup data set are one or more archives which are discrete chunks or “blobs” of data generally relating to a particular application. For example, one archive might contain log files related to a data store and another archive in the same backup data set might contain the data store itself.
Backup systems often have various levels of storage. A primary backup data set, for example, indicates the default destination of storage operations for a particular set of data that the storage policy relates to and is tied to a practical set of drives. These drives are addressed independently of the library or media agent to which they are attached. The primary backup data set might, for example, contain data that is frequently accessed for a period of one to two weeks after it is stored. A storage administrator might find storing such data on a set of drives with fast access times preferable. On the other hand, such fast drives are expensive and once the data is no longer accessed as frequently, the storage administrator might find it likely to move and copy this data to an auxiliary or secondary backup data set on a less expensive tape library or other device with slower access times. Once the data from the primary backup data set is moved to the auxiliary backup data set, the data can be pruned from the primary backup data set freeing up drive space for new data.
While existing data storage systems provide a capability to copy data from the primary backup data set to auxiliary backup data sets, this copying procedure is a synchronous operation, meaning generally all data from the primary backup data set must be copied to all auxiliary backup data sets. This process is also called synchronous data replication and is inefficient in terms of data management.
A backup data set will likely contain more than one full backup of data relating to a particular application in addition to incremental or differential backups taken between full backups. For example, a storage administrator might specify for a particular backup data set of a storage policy that a full backup occur once per week with incremental backups occurring daily. If the backup data set were retained for a period of two weeks before being pruned, the backup data set would contain a first full backup of data, F1, with incremental backups I1, I2, I3, I4, I5, I6, and a second full backup F2. If each full backup required one tape and each incremental required half a tape, then 5 tapes would be required to store the data of this exemplary primary backup data set. The auxiliary backup data set would also require 5 tapes when data is transferred from primary to auxiliary backup data set.
Thus, even though synchronous data replication allows the flexibility to promote any auxiliary backup data set to be primary backup data set since the auxiliary backup data set is a full copy of the primary backup data set, tape consumption is very high. If for some reason, data cannot be copied to one auxiliary backup data set, tapes from the primary backup data set will not be rotated. Thus, users may want to copy only particular backups as their degree of required granularity changes. One prominent scheme in the field illustrating this principle is called “Grandfather, Father, Son” (GFS), in which each of the three represents a different period of time. For example, the son may represent a weekly degree of granularity, the father may represent a monthly degree of granularity, and the grandfather may represent a yearly degree of granularity.
Many users do not wish to copy all backups from the primary backup data set to all auxiliary backup data sets. Over time, the degree of granularity that users require changes and while recent data might need to be restored from any given point in time, less precision is generally required when restoring older data. Consider an exemplary storage scheme where full backups are taken weekly, incremental backups are taken daily, data is pruned after two weeks, full backups require one tape, and incremental backups require half a tape. A storage administrator in this example might require that data stored in the past month be able to be restored at a level of granularity of one day, meaning the data can be restored from any given day in the past month. At this degree of granularity, the incremental backups would be necessary to restore data. If the backup data set contained a first full backup of data, F1, with incremental backups I1, I2, I3, I4, I5, I6, and a second full backup F2, then F1, I1, I2, I 3, I4, I5, I6 required. If incremental backup I6 is performed the same time full backup F2 is performed, the tape containing F2 would be unnecessary, since the full backup F2 could be reproduced from F1 and the incremental backups I1-I6. On the other hand, the storage administrator in this example might only require a degree of granularity of one week for data more than one month old thus the incremental backups would not be required and the full backups would suffice. In this case, only the tapes containing the full backups F1 and F2 would be required and the three tapes containing incremental backups I1, I2, I3, I4, I5, I6 would be unnecessary.
Another example is a storage policy with three backup data sets called Wkly, Mnthly, and Yrly with different retention criteria. Wkly backup data set has a retention period of 15 days, Mthly backup data set has a retention period of 6 months, and Yrly backup data set has a retention period of 7 years. Backups in this example are performed every day with a full backup on every Friday to Wkly backup data set. In addition, a full backup is done at the end of each month to Wkly backup data set. Only the full backup at the end of the week will be copied to Mnthly backup data set and only the end of the month full backup will be copied to Yrly backup data set. Under the assumption that every full backup uses 1 tape and incremental backups require ¼ of a tape, Wkly backup data set takes up to 6 tapes with at most 3 full backups and 12 incremental backups. These 6 tapes get recycled all the time. Mnthly backup data set takes 26 tapes that are constantly recycled and Yrly backup data set takes 1 tape per month for 7 years. Thus, 84 total tapes are required and are recycled over a long period of time.
Also, sometimes problems occur with bad tapes or holes in data due to hardware or software problems. In these instances, data from the primary backup data set cannot be pruned unless all data is copied to all auxiliary backup data sets which is a highly time intensive process and also requires a large number of tapes.
There is thus a need for a system which enables selective copying of data from the primary backup data set to auxiliary backup data sets, promotes efficient tape rotation, provides the capability to configure any variant of GFS scheme, and which further allows selective pruning of data from the primary backup data set.