Corporations and other organizations routinely copy data produced and/or stored by their computer systems in order to provide additional protection for the data, to comply with regulatory requirements, or for other business reasons. For example, a company might retain data from computing systems related to e-commerce, such as databases, file servers, web servers, and so on. The company may also retain data from computing systems used by employees, such as those used by an accounting department, marketing department, engineering, and so on. The data may include, for example, personal data, financial data, customer/client/patient data, audio/visual data, textual data, and other types of data. Organizations may also retain data related to the correct operation of their computer systems, such as operating system files, application files, user settings, and so on.
Current storage management systems employ a number of different methods to retain and archive data. For example, data can be stored in primary storage as a primary copy that includes production data, or in secondary storage as various types of secondary copies including, as a recovery copy, continuous data protection (“CDP”) copy, backup copy, a snapshot copy, a hierarchical storage management copy (“HSM”), an archive copy, and other types of secondary copies.
A primary, or active, copy of data is generally a production copy or other “live” version of the data which is used by a software application and is generally in the native format of that application. Primary copy data may be maintained in a local memory, disk, or other high-speed storage device that allows for relatively fast data access if necessary. Such primary copy data is typically intended for short term retention (e.g., several hours or days) before some or all of the data is stored as one or more secondary copies, for example to prevent loss of data in the event a problem occurs with the data stored in primary storage.
Secondary, or passive, copies include point-in-time data and are typically intended for longer-term retention (e.g., weeks, months or years depending on retention criteria, for example as specified in a storage policy or other policies as further described herein) before some or all of the data is moved to other storage or discarded. Secondary copies may be indexed so users can browse, search and restore the data at another point in time. A secondary copy may be stored on disk, tape, or other types of media. After certain primary copy data is backed up, a pointer or other location indicia such as a stub may be placed in the primary copy to indicate the current location of that data. Further details may be found in the assignee's U.S. Pat. No. 7,107,298, filed Sep. 30, 2002, entitled SYSTEM AND METHOD FOR ARCHIVING OBJECTS IN AN INFORMATION STORE.
One type of secondary copy is a backup copy. A backup copy is generally a point-in-time copy of the primary copy data stored in a backup format as opposed to in native application format. For example, a backup copy may be stored in a backup format that is optimized for compression and efficient long-term storage. Backup copies generally have relatively long retention periods and may be stored on media with slower retrieval times than other types of secondary copies and media. In some cases, backup copies may be stored at an offsite location.
Another form of secondary copy is a snapshot copy. From an end-user viewpoint, a snapshot may be thought as an instant image of the primary copy data at a given point in time. A snapshot may capture the directory structure of a primary copy volume at a particular moment in time, and may also preserve file attributes and contents. In some embodiments, a snapshot may exist as a virtual file system, parallel to the actual file system. Users may gain a read-only access to the record of files and directories of the snapshot. By electing to restore primary copy data from a snapshot taken at a given point in time, users may also return the current file system to the prior state of the file system that existed when the snapshot was taken.
A snapshot may be created nearly instantly, using a minimum of file space, but may still function as a conventional file system backup. A snapshot may not actually create another physical copy of all the data, but may simply create a table of pointers that are able to map files and directories to specific disk blocks. The table of pointers may indicate which blocks are unchanged, and if a block has changed, the table may point to a location where the previous, unchanged version of the block has been stored (copy-on-write).
An HSM copy is generally a copy of the primary copy data, but typically includes only a subset of the primary copy data that meets a certain criteria and is usually stored in a format other than the native application format (e.g., compressed, deduplicated, and converted to a generic format). For example, an HSM copy might include only that data from the primary copy that is larger than a given size threshold or older than a given age threshold and that is stored in a backup format. Often, HSM data is removed from the primary copy, and a stub is stored in the primary copy to indicate its new location. When a user requests access to the HSM data that has been removed or migrated, systems use the stub to locate the data and often make recovery of the data appear transparent even though the HSM data may be stored at a location different from the remaining primary copy data.
An archive copy is generally similar to an HSM copy, however, the data satisfying criteria for removal from the primary copy is generally completely removed with no stub left in the primary copy to indicate the new location (i.e., where it has been moved to). Archive copies of data are generally stored in a backup format or other non-native application format. In addition, archive copies are generally retained for very long periods of time (e.g., years) and in some cases are never deleted. Such archive copies may be made and kept for extended periods in order to meet compliance regulations or for other permanent storage applications.
Typical data storage systems create a first secondary backup copy from a production copy for short term data recovery and after a certain time send a copy to an archive for long term storage, e.g., to comply with regulatory retention requirements. Thus, organizations are storing large amounts of data in their data archives at great expense.
A copy of data may be “online” or “offline” with respect to an organization. An online copy is a copy whose entire contents are readily accessible over the organization's network, without the need for physically recalling or retrieving physical storage media that stores the copy, e.g., from an off-site location, and without the need for manual physical human intervention. An online copy may include, for example, a production copy stored on the organization's mail server, a snapshot copy stored on magnetic storage media that is connected to the organization's storage area network, an archive copy stored on a cloud storage site that is accessible by the organization's network, and a backup copy stored on a tape that is managed by a tape autoloader in the organization's tape library. An offline copy is one whose entire contents are not readily accessible over an organization's network because the physical storage media that stores the copy must be physically re-called or retrieved, e.g., from an off-site location, or because human intervention is required to retrieve the copy. An example of an offline copy is an archive copy that has been stored on tape media that has been sent to a secure offsite location. In such an example, in order to access the entire contents of the archive copy, the tape copy must be retrieved from the offsite location (e.g., via a physical shipment from the off-site location) and then the tape must be re-loaded into an on-site tape library. Another example of an offline copy is a secondary copy stored on a USB flash memory device that an employee has stored in her desk drawer. In such a scenario, the employee must physically re-connect the USB memory device to her computer before it will be accessible over the organization's network.
Companies are often required to retain documents in archive files and/or implement various data management tasks in order to comply with various data regulations and avoid enforcement actions. For example, when a company is in litigation, the company may be required to retain documents related to the litigation. Employees are often asked not to delete any correspondence, emails, or other documents related to the subject matter of the litigation. Recently enacted amendments to Federal Rules of Civil Procedure (FRCP) place additional document retention burdens on a company. According to Gartner, “Several legal commentators believe that the heart of the proposed changes to FRCP is the formal codification of “electronically stored information” (ESI) and the recognition that the traditional discovery framework dealing with paper-based documents is no longer adequate.” Legal discovery of electronic information has emerged as a key requirement for today's enterprise in recent years, and the new federal rules both strengthen and expand those requirements.
As another example, regulatory authorities all over the world are intensifying the monitoring and enforcement of specific electronic recordkeeping requirements. Such enforcement may relate to EU Data Protection, EU Data Privacy, Environmental Protection Act, Employment Act, Health & Safety Executive, enforcement in relation to standards including the British Standards Institution (BSI) BIP 0008, ISO/IEC 17799, ISO 15489, BS 7799, ISO 9000, as well as specific legislation on company records and anti-terrorism. There are also other concerns for organizations that must comply with U.S.-specific legislation such as the Sarbanes-Oxley Act, SEC Rule 17a-4, NASD 3110 and 3111, Health Information Technology for Economic and Clinical Health Act (HITECH) & Health Insurance Portability and Accountability Act (HIPAA) 1996, Obstruct Terrorism Act of 2001, Gramm-Leach-Bliley Act (GLBA), Financial Institution Privacy Protection Act of 2001 and the Financial Institution Privacy Protection Act of 2003. For federal, government and military institutions, compliance means conforming to the retention schemas and guidelines imposed by records management authorities, the recommendations of national archives or specific record statutes. For example, the US Department of Defense records-keeping requirements DoD 5015.2 and DoD 8320.02 impose a formal taxonomy, where compliance is about aligning records to specific retention and disposition policies based on the way they are classified and assigned metadata. European local government classification schemas, including the UK LGCS, as well as countless local US state classification models offer guidelines for the management of paper and electronic, structured and unstructured information records.
In summary, a single organization may be subject to several data-management regulations, each of which may:                relate to a different subset of the organization's data (e.g., financial records versus patient records),        require different data retention and security schemes (e.g., seven years of encrypted off-site archival storage versus indefinite storage on local fast media),        impose different roles and responsibilities on an organization's IT, compliance, and/or legal personnel (e.g., proactive and ongoing review versus regular annual reporting versus reactive on-demand review and reporting), and,        require different outcomes (e.g., the production of a dedicated litigation archive file versus a compliance report).It may prove difficult for an organization to comply with all of the myriad regulations related to document retention, particularly when many employees may have relevant documents stored under their control that are subject to regulation. Penalties for violation of data regulations can be steep, and executives and business managers want confidence that employees are taking appropriate steps to comply with the regulations. Employees may forget about requests to retain documents, or may not think that a particular document is relevant when others would disagree.        
In order to comply with regulations, companies also need provisions for finding retained documents. Traditional search engines accept a search query from a user, and generate a list of search results. The user typically views one or two of the results and then discards the results. However, some queries are part of a longer-term, collaborative process. For example, when a company receives a legal discovery request or other type of compliance request, the company is often required to mine all of the company's data for documents responsive to the request. This typically involves queries of different bodies of documents lasting days or even years. Many people are often part of the query, such as company employees, law firm associates, and law firm partners. The search results must often be viewed by more than one of these people in a well-defined set of steps (i.e., a workflow). For example, company employees may provide documents to a law firm, and associates at the law firm may perform an initial reading of the documents to determine if the documents contain relevant information. The associates may flag documents with descriptive classifications such as “relevant” or “privileged.” Then, the flagged documents may go to a law firm partner who will review each of the results and ultimately respond to the compliance request with the set of documents that satisfies the request.
Some regulations might also require that a query be applied on an ongoing basis to new data as it is created within an organization. For example, when subject to a legal discovery request, a company might be required to continue to mine new emails as they are created to determine if they are responsive to the discovery request. In such a scenario, the company may need to then direct any new responsive emails into the workflow described above. As another example, an organization might be required to continuously monitor its officer's correspondence and documents for indications of insider trading.
Collaborative document management systems exist for allowing multiple users to participate in the creation and revision of content, such as documents. Many collaborative document management systems provide an intuitive user interface that acts as a gathering place for collaborative participants. For example, Microsoft Sharepoint Server provides a web portal front end that allows collaborative participants to find shared content and to participate in the creation of new content and the revision of content created by others. In addition to directly modifying the content of a document, collaborative participants can add supplemental information, such as comments to the document. Many collaborative document management systems also provide workflows for defining sets of steps to be completed by one or more collaborative participants. For example, a collaborative document management system may provide a set of templates for performing common tasks, and a collaborative participant may be guided through a wizard-like interface that asks interview-style questions for completing a particular workflow.
The foregoing examples of some existing problems with data storage, archiving, and restoration are intended to be illustrative and not exclusive. Other limitations will become apparent to those of skill in the art upon a reading of the Detailed Description below.