Many computer users, and particularly enterprises having complex data processing and warehousing needs, may have multiple data processing and storage systems, such as databases or DBMSs. These systems may be used to process or access databases or other data sources that are relevant to the business for an extended period of time, whether for active and current business process needs, or for regulatory and other document retention purposes. Accordingly, enterprises frequently maintain older “legacy” systems for some time in order to access and manipulate data, even when that data is used infrequently. Generally, costs and inconvenience are associated with the maintenance of these legacy systems, for example because of ongoing maintenance costs, or ongoing license fees when the legacy system, or the operating system it runs on, is licensed to the enterprise from a third party. In other situations, a legacy system may be unsupported by the third party vendor, so there is significant risk to the enterprise of problems arising with the data system, even if dedicated support personnel are retained by the enterprise, which in itself causes increased costs.
In addition, servers dedicated to these legacy business applications involve costs on the hardware side, with hardware operating costs being incurred for a data system that may be used only very seldomly, if at all. Similarly to the data application running on the servers, the hardware itself may be outdated and unsupported. However, the enterprise may still need the ability to access the underlying data accessed and served by the legacy system, even if simply for compliance or other legal reasons. Accordingly, the enterprise may be unwilling to bear the risk that the data may become effectively unavailable should the supporting hardware or software fail for some reason.
Accordingly, enterprises may wish to implement a system where data from a legacy system will be stored in a manner that doesn't require the legacy system. For example, the enterprise may implement an application decommissioning or application retirement project. In such implementations, the enterprise may convert existing data from a database, such as a relational database, into a data structure such as XML that can be archived and accessed without the need of a particular RDBMS or other application designed specifically to manage the archived data, but at the same time preserving the required information from the legacy system. If a centralized archive system, such as an XML database, is maintained by the enterprise, the legacy application can be decommissioned when the data served by the application is converted to XML and validated. If XML is used, the data can be stored in a self-describing fashion, that is, the data itself contains a description of the structure applicable to the data. In this way, the data will be generally available in the future on an application-neutral basis, without reliance on any existing application or platform technology.
In the area of unstructured data, such as data stored in application files such as word processing or spreadsheet documents, or email messages or user mailboxes, a similar archiving process may be followed in order to reduce the need for legacy applications. For example, the documents may be virtually “printed,” that is, converted to a standardized format like PDF or TIF.
While enterprises have the ability to archive data in a way that can be expected to be available indefinitely, this archiving will frequently result in the archiving of all production data. That is, an enterprise may frequently decide given the balancing of costs, risks, processing time, and regulatory compliance, to simply archive all data without making any assessment of whether archiving is necessary. While this increases the costs associated with processing and storing the enterprise's archived data, this further results in greatly increased costs in the event the archived data should need to be retrieved, for example, in litigation or in response to a regulatory investigation. The scope of the original archiving will commonly be expected to result in a vast body of archived material with must be searched and processed. For many enterprises, it may not prove feasible to simply keep all production data, both structured and unstructured, indefinitely, particularly while the amount of data being processed, transmitted, and stored by the enterprise continues to grow.
While conversion to a standards-based format, such as XML, will likely make information available for a much longer time than archiving data in native form (i.e., in the format used directly by the legacy or other application), it can generally be expected that conversion to XML will result in some loss of fidelity from the original data, particularly in the case of unstructured data such as word processing documents, web pages, and emails. For unstructured data, a standards-based format such as PDF/A may increase the fidelity of the archived data to the original native data in comparison to XML, but PDF/A format is not without limitations that will often cause at least some variation between archived data and the original native format data. Even if there is no loss of fidelity, or the data can be losslessly converted between the long-term archive format and the native format, once the data is converted from the native application, it will generally increase overhead and latency to convert archived data back to the native format. This processing and delay occurs as the archiving system determines what the native application is for the data object, determines whether the archived content can be converted back to native format, and then, if possible, performs the conversion. If such conversation back to native is impossible for any reason, the user may be faced with a frustrating situation in which even recently archived content is available only in a format with significant loss of fidelity, and a format which cannot be manipulated by a native application still in use.
There is a need, therefore, for an improved method, article of manufacture, and apparatus for archiving data while limiting the loss of fidelity for data that is more likely to be accessed from archives, and while limiting system downtime and maximizing throughput.