1. The Field of the Invention
The present invention relates to systems and methods for relating files in a distributed data storage environment. More specifically, the present invention relates to systems and methods for relating groups of files transmitted to a remote storage site using an identifier unique to each group.
1. The Relevant Art
In a data processing system, a backup/restore subsystem, usually referred to as a backup subsystem, is typically used as a means to save a recent copy or version of a file, plus some number of earlier versions of the same file, on some form of backup storage devices such as magnetic disk drives, tapes, or optical storage devices. The backup subsystem is used as a means of protecting against loss of data in a given data processing system. For example, if an on-line version of a file is destroyed or corrupted because of power failure, hardware or software error, user error, or some other type of problem, the latest version of that file which is stored in a backup subsystem can be restored and therefore the risk of loss of data is minimized. Another important use of backup subsystems is that even if failures do not occur, but files or data are deleted or changed (either accidentally or intentionally), those files or data can be restored to their earlier state thus minimizing the loss of data.
A closely related concept to the backup subsystem is an archive/retrieve system, usually referred to as an archive subsystem. Archiving refers to making copies of files on lower cost storage such as tape so that the files can be deleted from more expensive technology such as disk storage. Since disk storage is frequently being updated, an archival copy also helps in preserving the state of a collection of data at a particular point in time.
Although the improved method of carrying out the backup disclosed in this application is primarily described for a backup system, it will be obvious to the person of ordinary skill in the art of data processing that the systems and methods described herein are also applicable to archive systems or other related data storage and storage management systems.
At the present time, the majority of backup systems run on host systems located in a data processing environment. Typically, a new version (also referred to as changed version) of a file is backed up based on a predetermined schedule such as, at the end of each day, or after each time that a file has been updated and saved.
Backup systems generally consume large amounts of storage media, because multiple versions of large amounts of data are being backed up on a regular basis. The transmission of the large amounts of data that prior art backup systems necessarily store also consume large amounts of network bandwidth. Therefore, those engaged in the field of data processing and especially in the field of backup/restore systems are continuously striving to find improved methods and systems to reduce the storage demand in backup systems. Previously, a full backup was conducted for each file in a system. More recently, an incremental backup method has been employed to enable the storage of and retrieval of multiple versions of a given file while consuming less storage space.
The full backup method is the most basic method used and requires the back up of an entire collection of files, or a file system, regardless of whether individual files in that collection have been updated or not. Furthermore, in the full backup method, multiple full versions of each file are maintained on a storage device. Since maintaining multiple full copies of many files consumes substantial amount of storage, compression techniques are sometimes used to reduce the amount of data stored. Compression techniques basically rely on the presence of redundancy within the file, so called intra-file redundancy, in order to achieve this reduction. The most common method is the use of a method of file compression known as Lempel-Ziv method (also known as Adaptive Dictionary Encoder or LZ coding) described in a book by T. C. Bell et. al, titled Text Compression, pp 206-235. The essence of Lempel-Ziv coding is that redundant phrases are replaced with an alias, thereby saving the storage space associated with multiple occurrences of any given phrase. This is a general method which can be applied to any file and typically results in compression ratios of the order of between 2 and 3.
Incremental backup is an alternative to full backup. In systems using incremental backup, backups are performed only for those files which have been modified since the previous incremental or full backup.
In any given backup system, the higher the backup frequency, the more accurately the backup copy will represent the present state of data within a file. Considering the large volume of data maintained and continuously generated in a typical data processing system, the amount of storage, time, and other resources associated with backing up data are very substantial. Thus, those skilled in the art are continuously engaged in searching for better alternatives and more storage and time efficient systems and methods for backing up data.
Aside from the compression technique which is heavily utilized to reduce storage requirement in a backup system, there exists a quite different method of achieving reduction in backup file size. This method is known as delta versioning or xe2x80x9cdifferencing.xe2x80x9d
Differencing relies on comparisons between two versions of the same file, where multiple versions are saved as a xe2x80x9cbase file,xe2x80x9d together with some number of xe2x80x9csub-filesxe2x80x9d which represent only the changes to the base file. These small files, also referred to as xe2x80x9cdelta filesxe2x80x9d or xe2x80x9cdifference files,xe2x80x9d contain only the changed portions, typically bytes or blocks which have changed from the base file. Delta files are generated as a result of comparing the current version of a file with an earlier version of the same file, referred to as the base file. Differencing thus exploits redundancy between file versions, in order to achieve reductions in storage space and network traffic.
Substantial storage savings in backup systems may result from the adoption of differencing techniques, since frequently the selection of a file for incremental backup occurs after a small change has been made to that file. Therefore, since many versions of a file that differ only slightly from one another may be backed up, differencing offers great potential for substantial reductions in the amount of data that must be transferred to and stored in the backup server.
Recently, the emergence of low cost local area networking, personal computer, and workstation technology has promoted a new type of data processing architecture known as the xe2x80x9cclient-serverxe2x80x9d system or environment. A client-server system 10, as shown in FIG. 1, typically consists of a plurality of client computers (also referred to as clients) 11, such as personal computers or workstations. The client computers 11 are preferably provided with a local storage medium 12 such as a disk storage device. The client computers 11 communicate over a network 13, such as an Ethernet or a Token Ring, which links the clients 11 to one or more network server computers 14.
The server computer 14 is generally a mainframe computer, a workstation, or other high end computer and is typically provided with one or more local storage mediums 15 such as a disk storage device, a tape storage device, and/or an optical storage device. The server computer 14 usually contains various programs or data which is shared by or otherwise accessible to the clients 11. Such a client-server system comnmunicating over a network is often referred to as a xe2x80x9cdistributedxe2x80x9d system or network.
The distributed client-server environment presents a number of major issues related to data processing, integrity, and backup of such data. One major concern in the client-server environment is that a substantial amount of critical data may be located on client subsystems which lack the security, reliability or care of administration that is typically applied to server computers. A further concern is that data may accidentally be lost from a client computer, as users of such computers often do not take the time and care necessary to back up the data on a regular basis. Another concern is that backing up large amounts of data from a client can require large amounts of network bandwidth and server storage space.
Recently a number of client-server backup systems have been developed to alleviate some of the concerns listed above. An example is IBM""s Tivoli Storage Manager (TSM), formerly known as ADSM (ADSTAR Distributed Storage Manager). This technology overcomes some of the deficiencies mentioned above by making backup copies of the client data on a backup server. The client copies are made automatically without user involvement and are stored on storage devices which are administered by the backup server.
A typical client-server backup system such as TSM typically operates with a client application operating in the client computer 11 and a server application operating in the server computer 14. The client application, also known as a client backup program, is activated at pre-specified or periodic times and makes contact with the server application, also referred to as a server backup program. After establishing contact and performing authentication, the client application then consults a user-configurable policy which instructs the client application regarding which sort of a backup operation should occur and which files on the client computer will be the subjects of the current backup. It then searches all or a subset of files on the client computer, determining which files should be backed up.
For example, a data file which has changed since the last backup was conducted may be selected for the backup operation. After selecting the files to be backed up, the client application transmits those files across the network to the server application. The server application then makes an entry in a listing such as a backup catalog for each file received and stores those files on storage devices attached to the backup server.
The backup system, in order to efficiently manage data storage may store data in storage devices organized in a storage hierarchy. A storage hierarchy provides a number of levels of storage devices with data storage in devices at the top levels being more expensive but having shorter access times. Moving down the hierarchy, data storage becomes less expensive, but the access times are longer. Accordingly, frequently accessed data is stored at the higher levels, while the lower levels are more suitable for long-term data storage. Among the levels of the hierarchy, data is stored in storage pools. A storage pool is a collection of storage volumes with similar geometries. Pools are collections of volumes capable of being used on a particular device. Examples of media stored in pools include tape, optical disks, magnetic disks, and other media having the same format.
The backup system also carries out several other important operations. For instance, backup copies of files that were made many months ago may be moved from disk storage to tape storage in order to reduce storage costs. Another important function of the client-server backup system occurs when the user requests the restoration of a file. The client application contacts the server application, which consults its backup catalog to establish the location of the backup copy of the file. It then returns that file across the network to the client computer which in turn makes it available to the user.
Examples of hardware which may be employed in a backup system in a distributed client-server environment include one or more server computers such as mainframes, workstations, and other high end computers and storage mediums such as the IBM 3390 magnetic storage system, IBM 3494 tape storage library or IBM 3595 optical library. Optical and tape storage libraries typically provide automated mechanical mounting and demounting of tape or optical cartridges into read/write drives. When several such devices are present, the server application is often configured to utilize the devices in a storage hierarchy in which the most likely to be accessed backup files are kept on faster access devices such as local non-volatile memory, and files less likely to be accessed are kept on less expensive, but slower access devices, such as tape or optical disks.
Despite the recent improvements made in the field of distributed client-server backup systems, certain shortcomings remain in currently available systems. Primary among these shortcomings is that the very large amounts of data on the clients now being regularly backed up tend to require large amounts of network bandwidth and to require high quantities of server storage space, which can be quite costly. Although storage management systems such as TSM may compress this data on the storage devices, the amount of data remains very large. Differencing is thought to be a solution to this problem, but differencing poses certain problems in itself
For instance, in a differencing backup system, once a base file is stored in the storage devices, the base file may not be available for immediate inspection. Often, the backup server is configured with a plurality of storage devices, such as optical devices, tape backups, and non-volatile memory (such as hard disk drives) organized in the above-described storage hierarchy. Within the storage hierarchy, the particular optical disks or tapes are frequently swapped out, and the only copy of a base file may be on a disk or tape that is not currently mounted. In addition, even when the base files are immediately available on such devices, accessing the base files and scanning the devices for the base files is a relatively slow process.
Current backup systems using the differencing method of backup typically store information about the files previously backed up to the server. This information helps determine the current state of backed up files and whether these files are still available. Nevertheless, for one reason or another, the versions of the backed up files may have changed between the client and the server. For instance, either the client""s record of the files or the server""s version of the files may have been deleted or inadvertently altered.
Accordingly, when a sub-file is transferred to the server, a reliable method is necessary to identify or xe2x80x9crelatexe2x80x9d the sub-file with the base file from which it was derived in order to later be able to combine the sub-file with its base file during a restore operation. If a sub-file is not restored with the correct corresponding base file, it is not possible to correctly reconstruct the original file, and a data integrity error occurs.
Certain additional challenges in relating sub-files to base files in a distributed environment. These stem from the fact that the elapsed time between backups of the base file and a dependent sub-file could be highly variable. Additionally, the client""s record of base file information could be invalid. For instance, the sub-file backup algorithm may have been disabled either on the client or on the server. Additionally, a client may back up data to multiple servers, causing the client""s knowledge of the base file to be invalid relative to one or more of the different servers. Furthermore, the server database may have been regressed to an earlier point in time in the interim between storing the base file and generating a sub-file. This might occur, for instance, as a result of the database becoming corrupted and being restored from an older version. Accordingly, as discussed, the base files the server knows about may not match those which the client has tracked.
It is apparent now that implementation of an efficient backup subsystem in a computer processing environment is a formidable task and implementing such a system in a distributed client-server environment poses significant challenges. Therefore, there is a need for an improved backup system and method in a client-server environment that not only substantially reduces the storage and network bandwidth requirements of current backup systems, but also minimizes the burden in communicating the relationships between groups of files, such as base files and their sub-files, between a client and a server. The present invention addresses these deficiencies currently present in prior art client-server backup systems by providing alternative methods and systems which can be used to reduce the amount of data storage and network bandwidth required in a client-server backup system while maintaining the integrity of the system through positive identification of the relationships between groups of files transmitted between the client and server.
The data storage management system and method of the present invention has been developed in response to the present state of the art, and in particular, in response to the problems and needs in the art that have not yet been fully solved by currently available storage management systems. Accordingly, it is an overall object of the present invention to provide a data storage management system and method that overcomes many or all of the above-discussed shortcomings in the art.
To achieve the foregoing object, and in accordance with the invention as embodied and broadly described herein in the preferred embodiment, an improved storage management system and method is provided. The data storage management system is preferably adapted to relate groups of files in a distributed data storage management system having a primary storage site such as a client computer and a remote storage site such as a server computer.
In one embodiment, the data storage management system comprises a primary storage site; a remote storage site communicating over a network with the primary storage site; a token generation module located within the primary storage site and configured to generate tokens uniquely identifying groups of files of the primary storage site; and a token listing readily available within the remote storage site and a token comparison module located within the remote storage site. The token comparison module is preferably configured to receive tokens passed together with a file from the primary storage site to the remote storage site and compare the tokens to one or more tokens within the token listing to establish a relationship of the file with other files previously transmitted from the primary storage site to the remote storage site.
The system may also comprise a plurality of base files resident within the storage devices of the remote storage site and a unique token for each of the base files stored within the token listing. A plurality of tokens may be stored within the token listing, and each of the plurality of tokens preferably uniquely identifies a base file resident within the storage devices of the remote storage site.
A backup determination module is preferably resident within the primary storage site and is preferably configured to select files for storage on the remote storage site, and determine whether the files should be stored as base files or sub-files. If the files are to be stored as sub-files, a sub-file generation module generates the sub-files by comparing the current file with a previously backed up base file. Thus, a plurality of sub-files is also preferably stored within the storage devices of the remote storage site, and each of the plurality of sub-files is preferably cross-linked with a base file resident within the storage devices.
The system also preferably comprises a repository located within the primary storage site. The repository preferably contains a representations of each of a plurality of base files stored on the remote storage site and also preferably stores a token unique to each of the base files together with the base files.
The token generation module is preferably configured to generate tokens at least partially indicative of the contents of the base files and may be configured to generate tokens comprising two components, a file identifier comprising attributes of a base file and an identification key derived from the contents of a base file.
Accompanying the data storage management system of the present invention may be a method for relating groups of files in a distributed data storage management system. In one embodiment, the method comprises a step of assigning a token to a base file of the primary storage site. The token uniquely identifies the base file and may be comprised of two components, a file identifier comprising attributes of the base file and an identification key derived from the contents of the base file.
In further steps, a copy of the base file is preferably passed from the primary storage site to the remote storage site, where the base file is preferably stored on a storage medium of the remote storage site. A copy of the token assigned to the base file is preferably passed together with the base file from the primary storage site to the remote storage site. The token is preferably stored in a token listing of the remote storage site.
A sub-file is preferably derived from the base file and the current file. A second token copied from or based upon the token of the base file is preferably associated with the sub-file and passed together with the sub-file to the remote storage site.
The remote storage site relates the sub-file to the base file by comparing the second token to the token listing and matching the token of the base file. Thereafter, a cross-linking between the sub-file and the base file is preferably generated, and the sub-file is stored in the storage hierarchy. Consequently, in response to a restore request from the primary storage site, the sub-file and the base file can be returned together to the primary storage site from the remote storage site.