1. Field of the Invention
The present invention relates to backups, and, more particularly, to optimizing backup overhead for defragmented disks.
2. Related Art
One of the problems with conventional backup schemes relates to defragmentation of the disk drive. As a practical matter, most large files are not stored on a disk sequentially. This is due to the fact that as files are added to the drive and deleted from the disk drive, free blocks become available, which the operating system then uses to store pieces of the file, wherever space is available. Thus, a single file can be broken up into a number of blocks, stored at different locations on the disk drive. When the file is accessed, those blocks are collected, and put back together into the original file. This operation involves overhead, and therefore it is desirable to have files whose blocks are stored sequentially, wherever possible. The process that rearranges the stored blocks on the disk, so that the blocks of the files are stored sequentially, to the extent possible, is called “defragmentation.”
Defragmentation can be performed relatively often, particularly in a server environment, where the server maintains a large number of files that constantly change. The problem with defragmentation and backups is that as far as the backup software is concerned, after the disk drive has been defragmented, it essentially needs to be backup all over again—from the perspective of the backup software, it is no longer possible to do an incremental backup, since such a large number of files have “changed.” This is despite the fact that the actual content of the files does not change at all—only the locations of the blocks that make up the file change. Therefore, an unnecessary complete (or near complete) backup needs to be performed after defragmentation, incurring considerable additional overhead due to the backup process.
Hierarchical Storage Management (HSM) system is known for backing up content of storage devices on different storage media. There are also conventional methods for investigating and using patterns in data processing. For example, the Ziv, Lempel and Welch algorithms implemented for data compression use detection of exact repetitions of data strings, and storing only a single instance of repeated string. This method uses limited spaces for storing content of repeated blocks and could not be used with acceptable performance for identification contents of long random data sequences similar to blocks of the storage devices.
Accordingly, there is a need in the art for the method of backing up large amounts of data storage device data with high performance and reliability.