In recent years there has been a problem of an increasing number of stored electronic documents that have identical or virtually identical content. For example, the Microsoft Outlook™ electronic mail system ordinarily results in multiple copies of an attachment being kept in data storage of a business enterprise when a document is sent by electronic mail to multiple recipients in the business enterprise.
In an attempt to solve the problem of multiple copies of a file being kept in a storage volume, Microsoft Corporation introduced a Single Instance Storage (SIS) feature in its Microsoft Windows® 2000 server. See William J. Bolosky, “Single Instance Storage in Windows® 2000,” USENIX Technical Program, WinsSys, Aug. 3-4, 2000, Seattle, Wash., USENIX, Berkeley, Calif. SIS uses links to the duplicate file content and copy-on-close semantics upon these links. SIS is structured as a file system filter driver that implements the links and a user level service that detects duplicate files and reports them to the filter for conversion into links.
SIS, however, will not reduce the data storage requirements or performance degradation due to virtually identical files. For example, an E-mail application such as the Microsoft Outlook™ electronic mail system may produce virtually identical files in a business enterprise when an E-mail is sent to multiple recipients in the business enterprise.
Data de-duplication techniques similar to SIS have been developed for reducing the data storage requirements of virtually identical files. These data de-duplication techniques determine file segments that are identical among virtually identical files, so that the data content of each shared file segment need be stored only once for the virtually identical files. The shared data content is placed in a common storage area, and each identical segment is removed from each of the virtually identical files and replaced with a corresponding link to the shared data content.
In a file server having a redundant data elimination (RDE) store, data de-duplication is applied to a file when the file is migrated into the file server or when new data is written to the file. For example, the migration process creates a new baseline version of the file in the file server, and copies data to the baseline version from a source external to the file server. The baseline version does not share file segments with other files in the file server. Then the baseline version is space reduced by applying data de-duplication.
For example, the migration process copies the data from the source external to the file server to newly allocated extents of logical data blocks in the data storage of the file server. Then the data de-duplication process converts the baseline version into a stub version that may reference shared extents of logical data blocks in the data storage of the file server. For example, the data de-duplication process copies the inode and indirect blocks of the baseline version to create the stub version. Initially an attribute of the file is set to indicate that the de-duplication process is in progress. Then the data de-duplication process searches the RDE store for a copy of the data in each extent of the baseline version, and if a copy of the data is found in the RDE store, then the pointer in the stub version is changed to point to the extent containing the copy of the data, and a reference counter in the RDE store for the extent containing the copy is incremented. Once the data de-duplication process has been applied to all of the extents of the baseline version, the attribute of the file is set to indicate that the de-duplication process is finished, and then the stub version is substituted for the baseline version, and the original inode and indirect blocks of the baseline version are deleted, and any extents of the baseline version not shared with the stub version are deallocated.