In many computing environments, large amounts of data are written to and retrieved from storage devices connected to one or more computers. For example, many large enterprises maintain local area networks (“LANs”) comprising multiple personal computers (“PCs”) which are used on a daily basis by employees. Typically, the employees regularly store data on the local disk drives within the PCs. As the amount of data stored on such local disk drives increases, the aggregate value of that data to the organization also increases. Consequently, it is a common practice to back up locally stored data by storing copies of the data on one or more remote, backup storage devices.
In many enterprises, the need to preserve data in backup storage systems generates a large and continuously increasing quantity of data. The increasing quantity of data can represent an ongoing challenge for an enterprise, because storage space requirements typically increase as a function of the quantity of stored data. Accordingly, there is a continuing need for effective and efficient methods for backing up data.
One well-known approach to backing up data is to generate a copy of data stored on a local storage device periodically and transmit the copy to a remote backup storage device. For example, in a large enterprise, such as that described above, data stored on one or more PCs in the network may be copied and transmitted via the network to a dedicated backup storage device located elsewhere on the network (or located outside the network). This procedure (referred to herein as a “backup session”) may be performed once per day, for example, or at any other specified interval.
In accordance with one backup strategy, selected data files are designated to be backed up, and a full copy of each designated file is transmitted to the backup storage device during each backup session. Another well-known approach to backing up data is to use an “incremental-and-full” strategy. During relatively frequent “incremental” backup sessions, which may be performed once per day, for example, each designated file is examined and any incremental changes made to the file since the most recent backup session are recorded in the backup storage device. In addition to the incremental backup sessions, “full” backup sessions are performed regularly—once per week, for example. During each “full” backup session, a full copy of each file is transmitted to and stored in the backup storage device.
Regardless of which approach is used to back up data, a typical backup storage system generates a large and increasing amount of data containing a large number of redundancies. In many cases, a file is changed only slightly between full backup sessions. Nevertheless, during each full backup session the entire file may be stored in a new memory location in the backup storage device. As a result, identical copies of the unchanged portions of these files may be stored multiple times in different locations within the backup storage system. The existence of redundancies in stored data within a backup storage system represents an undesirable and inefficient use of resources.
Accordingly, there is a need to reduce or eliminate redundancies in stored data within storage systems. If the format in which data is stored in a storage system is known, and an accurate directory system for the stored data is accessible, redundancies can be identified by using the directory, for example.
However, in some instances the format of the data stored in a storage system may not be known. Because there is no universally accepted format for storing data, a variety of different formats for storing data have been developed, and a variety of different formats are used by vendors of storage systems in their respective products. For example, there exist differences between disk formats used in storage systems offered by Hitachi Data Systems, located in Santa Clara, Calif., and those offered by EMC, located in Hopkinton, Mass. It should also be noted that the formatting and organization of stored data may also be affected by the file system used. For example, there exist differences between the format used by the Microsoft NTFS file system and the UNIX EXT3 file system.
The multiplicity of formats in existing storage systems poses a challenge when a party, or a software application, that is not familiar with the format used in a given storage system, attempts to perform a desired data processing operation with respect to the data stored in the system. For example, if a software application that is selected to eliminate and reduce redundancies within a backup storage system is not familiar with the format used by the system to store data, it will have difficulty performing its designated task. Although the software application may have access to the bits of data stored in the backup storage system, it may have no way to determine where data files begin and end. Even if a desired data file is found, the application may not be able to distinguish the various sections (the header section, the data section, etc.) of the data file.
Without knowledge of the format used by a storage system to store data, it can be challenging to identify and reduce redundancies within the stored data. One solution used in some backup storage systems is to employ a brute force method to locate multiple occurrences of a selected data block within the stored data, and delete all but one (or a few) of the copies. A “sliding window” technique is one such brute force approach. A sliding window is defined to be equal in length to the length of the data block in question. The window is applied to a selected location within the stored data to define a data segment equal in length to the data block. The data block in question is compared to the defined data segment. If the two do not match, the window is shifted by one byte, and another data segment (equal in length to the data block in question) is defined. This new data segment is compared to the data block. If the two do not match, the window is again shifted by one byte, and yet another data segment is defined. This method may be repeated multiple times until the data block is located within the stored data, and may be further repeated to identify additional occurrences of the data block. If multiple occurrences of the data block are found in the stored data, a mechanism to identify and register the duplicate blocks may be applied, and one or more of the copies may be deleted. This method can be very time consuming and inefficient.
Tape Libraries and Virtual Tape Libraries (VTLs)
Tape libraries have long been used in backup storage systems to store data. A tape library typically comprises one or more tapes and a mechanism, such as a tape drive, for reading and writing data on the tape(s). In addition, a backup software application manages the storage of data in the tape library. The backup software handles read and write requests received from client computers in a network and directs the requests to the tape library, for example.
Today, large amounts of data are stored in tape libraries. However, due to the inherent limitations of tape libraries, reading or writing data on a tape is often cumbersome and restrictive. Tape is a sequential medium; consequently it requires more time to access a desired data file stored on a tape than to access a file stored on a random-access medium, such as a disk drive. In addition, many tape libraries comprise mechanical parts used to load tapes, etc., and sometimes require human intervention to identify a desired tape or perform other tasks. Therefore, in many cases, performing a data processing operation on data stored on tape is slower than performing the corresponding operation on a random access medium such as a disk drive. As a result, virtual tape libraries (“VTLs”), which typically use one or more disk drives to store data, are sometimes installed in backup storage systems to replace mechanical tape libraries.
When a VTL is added to a tape library system, read and write requests received after the installation of the VTL are typically directed by the backup software to the VTL for storage. Accordingly, any new data is stored in the VTL. Data stored in the VTL is sometimes stored using the same format used by the original, mechanical tape library. Adopting the same format allows a VTL to replace a mechanical tape library and continue to work with the existing backup software seamlessly, thereby avoiding costly changes to an enterprise's IT infrastructure.
In some cases, however, a backup software application used to store data in a VTL is not familiar with, or is incompatible with, the format used to store data on tapes in the original tape library. In these instances, the inability of the backup software application to recognize data in the tape library can be inconvenient and problematic. For example, migrating data from a tape library to a VTL can be challenging when the backup software application used to store data in the VTL is not familiar with the format of the data stored in the tape library.
Use of Digests
In a variety of applications relating to the transmission and storage of data (including data security systems, data encryption systems, etc.), an ongoing need exists to represent data in an alternate form in such a way that the original data may be recovered. One approach that is commonly used involves the use of a known function to generate, for a respective data block, a value (often referred to as a “digest”) that represents the contents of the data block. The digest may be stored or transmitted and subsequently retrieved and processed to recover the data original block.
To be practical, a digest should be substantially smaller than the original data block. Ideally, each digest is uniquely associated with the respective data block from which it is derived. A function which generates a unique digest for each data block is said to be “collision-free.” In practice, it is sometimes acceptable to utilize a function that is substantially, but less than 100%, collision-free. A digest-generating function is referred to herein as a D-G function.
Any one of a wide variety of functions can be used to generate a digest. For example, one well-known D-G function is the cyclic redundancy check (CRC). Cryptographically strong hash functions are also often used for this purpose. A hash function performs a transformation on an input and returns a number having a fixed length—a hash value. Several well-known hash functions include the ability to (1) take a variable-sized input and generate a fixed-size output, (2) compute the hash relatively easily and quickly for any input value, and (3) be substantially (or “strongly”) collision-free. Examples of hash functions satisfying these criteria include, but are not limited to, the message digest 5 (MD5) algorithm and the secure hash (SHA-1) algorithm.
The MD5 algorithm generates a 16-byte (128-bit) hash value. It is designed to run on 32-bit computers. MD5 is substantially collision-free. Using MD5, hash values may be typically generated at high speed. The SHA-1 algorithm generates a 20-byte (160-bit) hash value. The maximum input length of a data block to the SHA-1 algorithm is 264 bits (˜1.8×1019 bits). The design of SHA-1 is similar to that of MD5, but because its output is larger, it is slightly slower than MD5, although it is more collision-free.