There are a number of regulations that require a variety of data records be available for retrieval, for a specified period of time, from non-modifiable, non-erasable archives. For example, Securities and Exchange Commission (SEC) Rule 17a-4 (i.e., 17 C.F.R. §240.17a-4) requires that certain stock exchange members, brokers, and dealers maintain certain records for a period of time (typically three or seven years). Rule 17a-4 (hereinafter “the Rule,” which may also encompass any other data permanence regulation) encompasses computerized records, such as e-mail and application documents (such as documents produced by Microsoft® Word® or Excel®). This data must therefore be archived for the period of time specified by the Rule in order to comply with the Rule.
Compliant storage is used to store the data required by the Rule. “Compliant” storage refers to data storage that complies with the Rule. “Non-compliant” storage refers to data storage that does not comply with the Rule. Compliance generally requires that the data must be archived and cannot be deleted or modified until the end of the period for which it must be retained. The data must also be retrievable in a reasonable period of time.
There are generally two types of computer records—those that are static and those that are frequently modified. E-mail is an example of static computer records. An e-mail sender composes and distributes an e-mail to one or more e-mail recipients. The recipients can either retain or delete the original e-mail. The recipients cannot modify the original e-mail. One or more recipients may reply to the e-mail, but the reply constitutes a discrete new record. Microsoft® Word® or Excel® documents are examples of frequently modified data. A user may work on the same document over a period of time. That user may choose to rewrite entire sections of the document. Over the course of its existence, a user may create hundreds or thousands of unique versions of that file that can be printed, viewed, or analyzed. The primary difference between static and modifiable documents is the notion of publication. An email recipient has received a published record. It has been distributed in a completed form. A Microsoft® Word® document, however, does not undergo such a publication event. Therefore, since a large percentage of computer data can easily be modified, several different versions of each document may need to be archived in order to comply with the Rule. Compliant storage therefore generates a copy of each document at a predetermined “reasonable” interval, for example, once a day. Compliant storage also requires that every copy must be non-deletable before the expiration of the period for which the copy must be maintained. Typical compliant storage includes optical and magnetic (tape) media.
FIG. 1 illustrates compliant storage of e-mail data. A system 100 includes an exchange server 102, an application 104, and compliant storage 106. The exchange server 102 includes a database 108 containing e-mail data. The application 104 extracts the e-mail data from the database 108 and stores the data on the compliant storage 106. The application may be software produced by Legato Systems, KVS, etc. The compliant storage 106 typically includes optical or tape media, and is stored for the period of time required by the Rule.
Databases are considered structured data, and e-mail data is considered semi-structured. Databases and other structured data can easily be stored using the application 104, and the compliant storage 106. The system 100 searches for new e-mail messages, and archives them. Since the e-mail database is semi-structured, and the e-mail data can easily be organized by the date of creation, the application 104 can easily determine the changes made since the last archive was created. Unstructured data, such as application files including word processing or spreadsheet files, cannot be archived using the system 100, because the application is unable to determine what changes have been made to the documents. As a result, unstructured data is typically archived in a compliant manner by performing full system backups to the compliant storage 106 on a regular basis.
Performing and maintaining frequent full system backups for several years, is both resource intensive and expensive. A storage server that requires archival may include tens of terabytes (TB) of data. If compliant backups are performed daily, then more than one thousand multi-TB backups would have to be maintained concurrently to comply with the Rule. The backups consume thousands of pieces of discrete media to be tracked, maintained, and kept available for retrieval.
Furthermore, during an investigation, the SEC may also require the stock exchange member, broker, or dealer under investigation to produce the records made on a specific date. System administrators managing the backups must search through the compliant media to locate the requisite data. The scale of search makes it difficult to fulfill the requirement that the data must be obtained within a short period of time to comply with the investigation.
What is needed is a method and apparatus for easily storing and retrieving archival documents to comply with the Rule.