In recent years there has been a problem of an increasing number of stored electronic documents that have identical or virtually identical content. For example, the Microsoft Outlook™ electronic mail system ordinarily results in multiple copies of an attachment being kept in data storage of a business enterprise when a document is sent by electronic mail to multiple recipients in the business enterprise.
In an attempt to solve the problem of multiple copies of a file being kept in a storage volume, Microsoft Corporation introduced a Single Instance Storage (SIS) feature in its Microsoft Windows® 2000 server. See William J. Bolosky, “Single Instance Storage in Windows® 2000,” USENIX Technical Program, WinsSys, Aug. 3-4, 2000, Seattle, Wash., USENIX, Berkeley, Calif. SIS uses links to the duplicate file content and copy-on-close semantics upon these links. SIS is structured as a file system filter driver that implements the links and a user level service that detects duplicate files and reports them to the filter for conversion into links.
In a file server having a file de-duplication facility, each file in a de-dupe file system has a de-dupe attribute indicating whether nor not the file has undergone a de-duplication task. When a file is migrated into the file server, the de-dupe attribute is initially cleared. The de-duplication task computes a hash of the data in the file to be de-duplicated. This hash is compared to hash values of de-duplicated files in the de-dupe file system. If the hash of the file to be de-duplicated matches the hash of a de-duplicated file in the de-dupe file system, then this match indicates a high probability that the data will also match. The file data may also be compared to further validate a match of the hash values. If a match of the file data is found, then the file to be de-duplicated is replaced with a stub inode linked to the indirect blocks and data blocks of the matching de-duplicated file.
Migration of files between file servers typically occurs in a hierarchical storage system, or in a distributed network storage system employing namespace virtualization, or in a wide-area network for distribution of read-only copies to remote mirror sites.
In a hierarchical storage system, frequently accessed files are kept in a primary file server having relatively fast but expensive storage, and less frequently accessed files are kept in a secondary file server having relatively inexpensive and slow storage. If a file stored in the primary file server is not accessed over a certain duration of time, the file is automatically migrated from the primary file server to the secondary file server. Client workstations request access to the file from the primary file server, and if the file is not found in the primary file server, then the primary file server requests access to the file from the secondary file server, and the file is migrated from the secondary file server back to the primary file server.
In a distributed network storage system employing namespace virtualization, the client workstations send file access requests to a namespace server. The namespace server maps the user-visible file names to pathnames in a storage network namespace, and functions as a proxy server by forwarding the translated client file access requests to back-end servers in the storage network. The namespace server may migrate files between back-end servers for load balancing upon the back-end servers and for more efficient utilization of different classes of storage by moving infrequently accessed files to slower and less expensive storage and by moving frequently accessed files to faster and more expensive storage.
In a wide-area network, read-only files, such as web pages and document or program downloads, are often distributed from a local site to one or more geographically remote sites for servicing users in different geographic regions or countries. The remote sites are maintained as least-recently-accessed caches of the read-only files. A file copy is migrated from the local site to a remote site in response to a request for access of a remote user when the requested file is not found at the remote site in the user's region or country.