In recent years, the data storage industry has been immensely successful in providing ever increasing amounts of data storage at decreased cost. This has permitted customers to keep vast numbers of business and technical documents in data storage for convenient access. Unfortunately it has also permitted software vendors to sell applications that generate many electronic documents that have identical or virtually identical content. These applications not only waste storage space but also waste processing time and reduce primary cache efficiency when multiple copies are recalled from storage pursuant to a search. For example, the Microsoft Outlook™ electronic mail system ordinarily results in multiple copies of an attachment being kept in data storage of a business enterprise when a document is sent by electronic mail to multiple recipients in the business enterprise.
In an attempt to solve the problem of multiple copies of a file being kept in a storage volume, Microsoft Corporation introduced a Single Instance Storage (SIS) feature in its Microsoft Windows® 2000 server. See William J. Bolosky, “Single Instance Storage in Windows® 2000,” USENIX Technical Program, WinsSys, Aug. 3-4, 2000, Seattle, Wash., USENIX, Berkeley, Calif. SIS uses links to the duplicate file content and copy-on-close semantics upon these links. SIS is structured as a file system filter driver that implements the links and a user level service that detects duplicate files and reports them to the filter for conversion into links.
SIS, however, will not reduce the data storage requirements or performance degradation due to virtually identical files. For example, an E-mail application such as the Microsoft Outlook™ electronic mail system may produce virtually identical files in a business enterprise when an E-mail is sent to multiple recipients in the business enterprise. Data de-duplication techniques similar to SIS have been developed for reducing the data storage requirements of virtually identical files. These data de-duplication techniques determine file segments that are identical among virtually identical files, so that the data content of each shared file segment need be stored only once for the virtually identical files. The shared data content is placed in a common storage area, and each identical segment is removed from each of the virtually identical files and replaced with a corresponding link to the shared data content.
Because customers have kept vast numbers of business and technical documents in data storage for convenient access, many customers have been surprised by the cost of producing their electronic documents for regulatory compliance and for responding to discovery requests in litigation. For regulatory compliance, electronic document retention techniques have been developed so that critical documents are retained in disk storage until a specified expiration time.
SIS, de-duplication, and electronic document retention techniques are specific examples of information lifecycle management (ILM) strategies to facilitate efficient storage and selective recall of electronic business and technical documents. Many of these techniques involve classification and indexing of information in the electronic documents.