1. The Field of the Invention
The present invention relates to systems and methods for transferring related data objects across a distributed data storage environment. More specifically, the present invention relates to systems and methods for transferring a group of data objects between a source storage site and a target storage site using an identifier unique to the group of data objects and tracking the group of data objects with a listing within the target storage site so that only complete groups are transferred and retained at the target.
2. The Relevant Art
In a data processing system, a backup/restore subsystem, usually referred to as a backup subsystem, is typically used as a means to save a recent copy or version of a file, plus some number of earlier versions of the same file, on some form of backup storage devices such as magnetic disk drives, tapes, or optical storage devices. The backup subsystem is used as a means of protecting against loss of data in a given data processing system. For example, if an on-line version of a file is destroyed or corrupted because of power failure, hardware or software error, user error, or some other type of problem, the latest version of that file which is stored in a backup subsystem can be restored and therefore the risk of loss of data is minimized. Another important use of backup subsystems is that even if failures do not occur, but files or data are deleted or changed (either accidentally or intentionally), those files or data can be restored to their earlier state thus minimizing the loss of data.
A closely related concept to the backup subsystem is an archive/retrieve system, usually referred to as an archive subsystem. Archiving refers to making copies of files on lower cost storage such as tape so that the files can be deleted from more expensive technology such as disk storage. Since disk storage is frequently being updated, an archival copy also helps in preserving the state of a collection of data at a particular point in time. Although the improved method of transferring grouped data objects disclosed in this application is primarily described for a backup system, it will be obvious to the person of ordinary skill in the art of data processing that the systems and methods described herein are also applicable to archive systems or other related data storage and storage management systems.
At the present time, the majority of backup systems run on host systems located in a data processing environment. Typically, a new version (also referred to as changed version) of a file is backed up based on a predetermined schedule such as, at the end of each day, or after each time that a file has been updated and saved.
Recently, the emergence of low cost local area networking, personal computer, and workstation technology has promoted a new type of data processing architecture known as the “client-server” system or environment. A client-server system 10, as shown in FIG. 1, typically consists of a plurality of client computers (also referred to as clients) 11, such as personal computers or workstations. The client computers 11 are preferably provided with a local storage medium 12 such as a disk storage device. The client computers 11 communicate over a network 13, such as an Ethernet or a Token Ring, which links the clients 11 to one or more network server computers 14.
The server computer 14 is generally a mainframe computer, a workstation, or other high end computer and is typically provided with one or more local storage mediums 15 such as a disk storage device, a tape storage device, and/or an optical storage device. The server computer 14 usually contains various programs or data which are shared by or otherwise accessible to the clients 12. Such a client-server system communicating over a network is often referred to as a “distributed” system.
The distributed client-server environment presents a number of major issues related to data processing, integrity, and backup of such data. One major concern in the client-server environment is that a substantial amount of critical data may be located on client subsystems which lack the security, reliability or care of administration that is typically applied to server computers. A further concern is that data may accidentally be lost from a client computer, as users of such computers often do not take the time and care necessary to back up the data on a regular basis. Another concern is that backing up large amounts of data from a client can require large amounts of network bandwidth and server storage space.
Recently a number of client-server backup systems have been developed to alleviate some of the concerns listed above. An example is IBM's Tivoli Storage Manager (TSM), formerly known as ADSM (ADSTAR Distributed Storage Manager). This technology overcomes some of the deficiencies mentioned above by making backup copies of the client data on a backup server. The client copies are made automatically without user involvement and are stored on storage devices which are administered by the backup server.
A typical client-server backup system such as TSM typically operates with a client application operating in the client computer 11 and a server application operating in the server computer 14. The client application, also known as a client backup program, is activated at pre-specified or periodic times and makes contact with the server application, also referred to as a server backup program. After establishing contact and performing authentication, the client application then consults a user-configurable policy which instructs the client application regarding which sort of a backup operation should occur and which files on the client computer will be the subjects of the current backup. It then searches all or a subset of files on the client computer, determining which files should be backed up.
For example, a data file which has changed since the last backup was conducted may be selected for the backup operation. After selecting the files to be backed up, the client application transmits those files across the network to the server application. The server application then makes an entry in a listing such as a backup catalog for each file received and stores those files on a storage device attached to the backup server.
The backup may be conducted as an incremental backup and may utilize differencing. In systems using incremental backup, backups are performed only for those files which have been modified since the previous incremental or full backup. Differencing relies on comparisons between two versions of the same file, where multiple versions are saved as a “base file,” together with some number of “sub-files” which represent only the changes to the base file. These small files, also referred to as “delta files” or “difference files,” contain only the changed portions, typically bytes or blocks which have changed from the base file. Delta files are generated as a result of comparing the current version of a file with an earlier version of the same file, referred to as the base file. Differencing thus exploits redundancy between file versions, in order to achieve reductions in storage space and network traffic.
The backup system, in order to efficiently manage data storage may store data in storage devices organized in a storage hierarchy. A storage hierarchy provides a number of levels of storage devices with data storage in devices at the top levels being more expensive but having shorter access times. Moving down the hierarchy, data storage becomes less expensive, but the access times are longer. Accordingly, frequently accessed data is stored at the higher levels, while the lower levels are more suitable for long-term data storage. Among the levels of the hierarchy, data is stored in storage pools. A storage pool is a collection of storage volumes with similar geometries. Pools are collections of volumes capable of being used on a particular device. Examples of media stored in pools include tape, optical disks, magnetic disks, and other media having the same format.
The backup system also carries out several other important operations. For instance, backup copies of files that were made previously may be moved from disk storage to tape storage in order to reduce storage costs. Another important function of the client-server backup system occurs when the user requests the restoration of a file. The client application contacts the server application, which consults its backup catalog to establish the location of the backup copy of the file. The server then returns that file across the network to the client computer which in turn makes it available to the user.
Examples of hardware which may be employed in a backup system in a distributed client-server environment include one or more server computers such as mainframes, workstations, and other high end computers and storage mediums such as the IBM 3390 magnetic storage system, IBM 3494 tape storage library or IBM 3595 optical library. Optical and tape storage libraries typically provide automated mechanical mounting and demounting of tape or optical cartridges into read/write drives. When several such devices are present, the server application is often configured to utilize the devices in a storage hierarchy in which the most likely to be accessed backup files are kept on faster access devices such as local non-volatile memory, and files less likely to be accessed are kept on less expensive, but slower access devices, such as tape or optical disks.
One challenge in such distributed backup systems is that a backup server may become outdated or of insufficient capacity. For this and other reasons, it may become necessary to transfer the files and other data objects stored on the server to another server. This transfer of data objects presents challenges due to the nature of the storage hierarchy on which the data objects are stored. Data objects may be distributed across many different volumes and different media types of the storage hierarchy. Data transfer is further complicated by the fact that some of the data objects may be related in a group, yet may not be contiguously stored on the storage media. For instance, in differencing backup systems, groups of files comprising a base file and the sub-files reflecting modifications to the base file must be available for restoration to a client. Any transfer of these grouped files must track the relationship of these files.
Accordingly, transferring groups of related data objects between distributed backup systems presents a dilemma. Transferring the files in a grouped order would consume an inordinate amount of time, because so doing would require excessive mounting and positioning of storage pool volumes. Yet, transferring the files in the order in which they are stored (which may be based on the order in which they were received) makes it difficult to track the relationship between the files within the groups. Accordingly, a need exists in the art for a system and method capable of efficiently transferring groups of related files in a distributed data storage system while tracking the relationship of the grouped files so that the grouped files can be associated on the target storage site.