In many computing environments, large amounts of data are written to and retrieved from storage devices connected to one or more computers. For example, many large enterprises maintain local area networks (“LANs”) comprising multiple personal computers (“PCs”) which are used on a daily basis by employees. Typically, the employees regularly store data on the local disk drives within the PCs. As the amount of data stored on such local disk drives increases, the aggregate value of that data to the organization also increases. Consequently, it is a common practice to back up locally stored data by storing copies of the data on one or more remote, backup storage devices.
In many enterprises, the need to preserve data in backup storage systems generates a large and continuously increasing quantity of data. The increasing quantity of data can represent an ongoing challenge for an enterprise, because storage space requirements typically increase as a function of the quantity of stored data. Accordingly, there is a continuing need for effective and efficient methods for backing up data.
One well-known approach to backing up data is to generate a copy of data stored on a local storage device periodically and transmit the copy to a remote backup storage device. For example, in a large enterprise, such as that described above, data stored on one or more PCs in the network may be copied and transmitted via the network to a dedicated backup storage device located elsewhere on the network (or located outside the network). This procedure (referred to herein as a “backup session”) may be performed once per day, for example, or at any other specified interval.
In accordance with one backup strategy, selected data files are designated to be backed up, and a full copy of each designated file is transmitted to the backup storage device during each backup session. Another well-known approach to backing up data is to use an “incremental-and-full” strategy. During relatively frequent “incremental” backup sessions, which may be performed once per day, for example, each designated file is examined and any incremental changes made to the file since the most recent backup session are recorded in the backup storage device. In addition to the incremental backup sessions, “full” backup sessions are performed regularly—once per week, for example. During each “full” backup session, a full copy of each file is transmitted to and stored in the backup storage device.
Regardless of which approach is used to back up data, a typical backup storage system generates a large and increasing amount of data containing a large number of redundancies. In many cases, a file is changed only slightly between full backup sessions. Nevertheless, during each full backup session the entire file may be stored in a new memory location in the backup storage device. As a result, identical copies of the unchanged portions of these files may be stored multiple times in different locations within the backup storage system. The existence of redundancies in stored data within a backup storage system represents an undesirable and inefficient use of resources.
Accordingly, there is a need to reduce or eliminate redundancies in stored data within storage systems. If the format in which data is stored in a storage system is known, and an accurate directory system for the stored data is accessible, redundancies can be identified by using the directory, for example.
However, in some instances the format of the data stored in a storage system may not be known. Because there is no universally accepted format for storing data, a variety of different formats for storing data have been developed, and a variety of different formats are used by vendors of storage systems in their respective products. For example, there exist differences between disk formats used in storage systems offered by Hitachi Data Systems, located in Santa Clara, Calif., and those offered by EMC, located in Hopkinton, Mass. It should also be noted that the formatting and organization of stored data may also be affected by the file system used. For example, there exist differences between the format used by the Microsoft NTFS file system and the UNIX EXT3 file system.
The multiplicity of formats in existing storage systems poses a challenge when a party, or a software application, that is not familiar with the format used in a given storage system, attempts to perform a desired data processing operation with respect to the data stored in the system. For example, if a software application that is selected to eliminate and reduce redundancies within a backup storage system is not familiar with the format used by the system to store data, it will have difficulty performing its designated task. Although the software application may have access to the bits of data stored in the backup storage system, it may have no way to determine where data files begin and end. Even if a desired data file is found, the application may not be able to distinguish the various sections (the header section, the data section, etc.) of the data file.
Without knowledge of the format used by a storage system to store data, it can be challenging to identify and reduce redundancies within the stored data. One solution used in some backup storage systems is to employ a brute force method to locate multiple occurrences of a selected data block within the stored data, and delete all but one (or a few) of the copies. A “sliding window” technique is one such brute force approach. A sliding window is defined to be equal in length to the length of the data block in question. The window is applied to a selected location within the stored data to define a data segment equal in length to the data block. The data block in question is compared to the defined data segment. If the two do not match, the window is shifted by one byte, and another data segment (equal in length to the data block in question) is defined. This new data segment is compared to the data block. If the two do not match, the window is again shifted by one byte, and yet another data segment is defined. This method may be repeated multiple times until the data block is located within the stored data, and may be further repeated to identify additional occurrences of the data block. If multiple occurrences of the data block are found in the stored data, a mechanism to identify and register the duplicate blocks may be applied, and one or more of the copies may be deleted. This method can be very time consuming and inefficient.