In order to clarify the scope of the present invention, it is first useful to distinguish between the terms “data archiving” and “data preservation” as used in this application. Conventional approaches to digital data archiving, also termed digital data storage, use a variety of storage media such as magnetic tape or disk and optical tape or disk media, and may employ techniques such as periodic tape backup, redundant disk storage, and the like. Use of these storage media and techniques provides some level of assurance that a digital data file can be reliably retrieved for at least a few years after it is initially created and stored. In contrast to digital data archiving, digital data preservation is a relatively new concept. Only recently has it become apparent that there is considerable need for workable solutions that allow long-term retention of digital data for periods exceeding those provided by established data archiving methods. Conventional data storage and archiving systems provide limited term solutions that allow reliable retrieval of backed-up digital data for a period of approximately 5–10 years. Data preservation systems, on the other hand, must provide solutions that not only allow retrieval of digital data for much longer periods, but also are capable of allowing usability of the data for periods extending decades or even hundreds of years into the future. This life-span is conditioned in large part by the projected life-span of preservation media, expected to last for hundreds of years when stored under suitable conditions.
In contrast with digital data archiving, digital data preservation offers a number of added advantages. For example, in order to be readable and usable years hence, archived digital data requires some type of migration, such as from one media type to another or from an earlier data format to a later data format. Without migration of some kind, archived data, over time, gradually becomes unreadable and therefore loses its value. In stages, the archived data first becomes unusable, as data formats and application software are revised or replaced. Then, as reading and processing hardware become obsolete, the archived data simply becomes unrecoverable. The task of maintaining archived data through migration can be daunting, requiring, over a period of years, that the archived data be translated from one data format to another or transferred from one storage medium to another. With repeated migration operations, there is increased likelihood of error and of loss of interpretable data. According to some industry estimates, as much as 5% of stored data can be lost during a typical migration operation. Thus, maintaining archived digital data for long periods of time may be costly and labor-intensive and can involve risk of data loss.
In contrast to such well-known difficulties with digital data archiving, digital data preservation would allow digital data to be retrievable in a readable state for many years. Ideally, digital data preservation would eliminate, or at least alleviate, any need for data migration and its concomitant costs and risks. Users of digital data preservation systems would thus enjoy the benefits of minimal risk for data loss or obsolescence, even in the event of severe infrastructure disruption.
Digitally created documents, created using some sort of logic processor and maintained in file form, are often shared among multiple users in digital form, some only rarely being written to paper. Typically, digitally created documents are stored and transferred as files in open data formats, such as TIFF, HTML, JPEG, XML, or .txt, for example. By design, some of these open data formats can be routinely interpreted by software running on a number of different computer platforms. Alternately, other common data formats are designed to be proprietary, interpretable only using specific application software. A goal of digital preservation is to retain the usability and original intention of the data without requiring migration of data format or of data storage mechanisms, allowing files to be certifiably unaltered in their interpreted form, able to be used for purposes such as legal evidence, for example.
In order to have preserved records considered as “certifiably unalterable”, so that, for example, such records could even be considered as legal evidence, a preservation system would need to provide “Write-Once/Read-Many-Times/Erase-Once” function. Write-Once capability would disallow alteration of preserved data and unauthorized addition of records to preservation media. Read-Many-Times capability would allow retrieval of preserved data from the media with consistent accuracy. Erase-Once capability would assure complete expungement of specific data records as needed.
Current archiving methods for digital data, allowing access to data only in digital format, have a number of shortcomings. Among problems well known by those skilled in the data archiving arts are aging of equipment, limitations in the useful life of magnetic and optical storage media, and inevitable obsolescence of data formats, particularly where data formats are closely associated with specific hardware or with specific versions of operating systems or programming languages.
Long term preservation of digital data requires both that the original data be faithfully preserved and that this data can be interpreted in some form at any time in the future. This requirement means that the organization that stores the digital data can provide, at some future time, access not only to screen displays, printouts, and other system output, but also to the original data used to generate such output. To achieve this goal, methods for retrieving preserved digital data must be, insofar as is possible, independent of specific equipment. While there may have been various attempts at developing universally accepted data formats for different types of files, few standards have been developed or are likely to be adopted.
Human-readability has not been considered as a meaningful or useful characteristic for data preservation. However, the encoding of data in human-readable form may provide advantages that have been overlooked in any scheme for data encoding and archival. For example, there are baseline advantages for verifying authenticity of a document encoded in human-readable form, and thus for irrefutably validating the fidelity of the document to its original source. Future users of a document would then be assured that a preserved version would be a valid and true copy of an original document.
FIG. 1 illustrates the conventional approach to digital data archiving. Digital data is processed by a central processing unit (CPU) 200 running some type of operating system 204. An application 202, using utilities available from operating system 204, provides digital data output in some binary, machine-readable form. This digital data output is only usable to the originating application 202, or to another software application compatible with application 202. The digital data output has value only when interpreted and presented by application 202 in some form, such as that of a static display of text or images, interactive calculation, web page with dynamic links, or multimedia presentation for example. In the conventional model of FIG. 1, a binary storage hardware apparatus 206 stores the digital data output from application 202 onto binary storage media 208, such as magnetic tape, disk, or optical disk. With the arrangement of FIG. 1, the archived data is in an application-dependent form and therefore becomes unusable if the originating application 202 or operating system 204 become obsolete. Archived data also becomes unusable as binary storage media 208 degrades over time.
Technology development, by which early systems and software become obsolete, replaced by increasingly more capable tools, is also an important factor for consideration with respect to a digital data preservation system. Anticipated developments in data networking technology, in data interface methods, and in imaging technologies for storage and retrieval are likely to bring about corresponding changes in system hardware, with various components of a system becoming obsolete over time. Inherent to the design of a digital data preservation system solution must be a clear-cut strategy for allowing continuous upgrade, component by component, without jeopardizing the integrity of the preserved digital data.
Analog preservation media, such as microfilm, have been widely used for long-term retention of documents, drawings, and flat ASCII files, where data is encoded visually as black and white images. Among proven benefits of such media are long lifetimes, capability for very high resolution, and inherent human readability. These analog preservation media have traditionally been used in systems employing optical cameras for recording and storing analog data, typically images of documents. With the growing need for retention of computer data, these analog media have also been employed in digital document archiving systems, such as the Document Archive Writer, Model 4800, manufactured by Eastman Kodak Company, Rochester, N.Y. Other Computer-Output-Microfilm (COM) recording systems have used similar analog media for long-term retention of processed and displayed data, in printout form. It is significant to note that existing systems use these types of analog preservation media solely for storing black and white images of documents that may be output by a typical application 202 (FIG. 1). Storage of digital data from application 202 is performed using conventional, magnetic or optical binary storage media 208.
A digital data file for preservation by a digital preservation system can originate from any of a number of sources and could comprise any of a number of types of data. As just a few examples, digital data files could be generated from scanned documents or scanned images, where the original source for the data was prepared or handled manually. Digital data files may comprise encodings of bitonal images, grayscale images, or even color images, such as the halftone separations used in color printing. Digital data files could be computer-generated files, such as spreadsheets, CAD drawings, forms created on-line, Web pages, or computer-generated artwork. Interactive and sensory stimuli such as sound and animation can also be stored as digital data files. Digital data files might even contain computer software, in source code or binary code format. In summary, there can be a need for long-range preservation of any type of digital data file, whether the actual file content is meaningful to an observer, such as when the file contains a document of some kind, or to a computer, such as when the file consists only of encoded computer program instructions.
Preservation of a digital data file typically requires that the data file be packaged in some standard fashion, so that at least some amount of metadata, that is, data about the file itself, can be stored with the data. For example, metadata associated with a CAD file might identify the originating software and revision, date of creation and revision of the data, designer name, departmental and project-related identifiers, delivery or completion date, workflow listing, access permissions levels, and the like. Metadata content can include not only basic information such as file ID and look-up information, but also information that optimizes subsequent data retrieval and interpretation, such as image quality metrics, and media/writer characteristics.
The likely obsolescence of specific data formats over time confounds the problem of data preservation. Depending upon the type of data source and upon factors such as the specific nature of a data file, many data formats can be expected to fade from use, thereby jeopardizing possible recall of data content at some future time. A number of organizations have already encountered this problem, acknowledging that sizable amounts of stored data have become very costly or even impossible to retrieve reliably.
Meanwhile, there have been some promising solutions proposed for providing data in a form that will continue to be readable in the future. One method intended to achieve this goal is the extensible markup language (XML) initiative. XML, document type description (DTD), and XML Schema constructs provide a degree of self-definition, inherently open structure, and computer platform portability and provide tools for data formatting by which definitions of data components can themselves be stored as metadata associated with a data file. However, there has been no attempt thus far to provide solutions using extensible markup languages and techniques that support long-term preservation and retrieval of data.
There have been methods disclosed for storing documents in a machine-readable format that is perceptible to a human observer. PCT application WO/28726 (Smith, Leonhardt, Frary) discloses storage of a two-dimensional document on a laser-writeable optical storage medium, wherein an image of the document is written onto the media along with the binary data representing the digital record. However, the solution disclosed in application WO 00/28726 is limited to storage of document data, which is merely a subset of the complete set of data types that may need to be preserved. A significant drawback of the PCT application WO 00/28726 system is that it employs conventional, optical storage medium, optical disk or tape written using a laser, thus limiting the lifetime of stored data. Furthermore, the Write-Many-Times characteristic of the system disclosed in PCT application WO 00/28726 makes the system unsuitable for preserving data records that are certifiably unaltered over time. Data written using the system disclosed in PCT application WO 00/28726 may be marginally “human-perceptible” in the sense that the visible effects of marking the optical medium under varying laser intensities could be perceived and interpreted by a human observer trained to interpret the resultant markings as binary 1s and 0s. However, this encoding method is inefficient in providing truly “human-readable” data that would be directly readable using a scanner or could even be read from the media by a human observer. Without intervening hardware, with its incumbent system dependencies, the binary data stored on the optical medium as disclosed in PCT application WO 00/28726 would be extremely difficult to obtain.
Copending, commonly-assigned U.S. patent application Ser. No. 09/703,059, filed Oct. 31, 2000, entitled “A Method and Apparatus for Long Term Document Preservation” discloses long term preservation methods for document data stored in virtual folders, utilizing an analog medium such as film. As with other solutions, this system does not provide the full set of possible preservation functions for a digital file. Significantly, the method noted in application Ser. No. 09/703,059 is limited to preserving the image of the document only, with no attempt to preserve the digitally created document data itself nor the metadata associated with the document in human-readable form.
The above-mentioned solutions, focusing more narrowly on saving documents and images for a time, have provided only “single point” solutions that are not adequate for addressing the larger data preservation problem. Documents themselves make up only a small subset of digital data that must be preserved. Typical forms of digital data other than documents that may require preservation include grayscale and color pictures and diagnostic images; spreadsheet data; satellite data and other instrumentation readings; audio, video and multimedia presentation data; software; HTML content; and database records, for example. It can be appreciated that preservation and retrieval of this broader base of digital data types requires alternate approaches beyond what may be needed for document preservation. For example, with digital data in this broader category, there may be a greater need for retention and retrieval of other underlying, related data, such as source data associated with or used to generate some part of an image or document.
Users of conventional systems for archival of documents and images on microfilm are familiar with the level of image quality obtained from such systems, based on long-term experience with optical recording methods. Even with the advent of digital archival writers, the basic model established with earlier optical recording methods has substantially been maintained. For example, with respect to overall image quality for the archived document or image, there are few options available with conventional monochrome image archival systems. Hence, there would be no need for viewing the results of an archival operation when a document is initially stored. However, the capability for storing color image data, encoded in a monochrome medium, provides a new model for document and image archival and preservation and makes some options available to users of archival and preservation systems.
It is well known that a document, when output on different printing apparatus using different print driver software, not only has obvious differences due to characteristics such as printer resolution and media response, but can also be formatted differently. To give an approximate idea of the final appearance of printed output, manufacturers of various software packages often provide a print preview function. Using print preview, a user can get a good idea of the final appearance of a document or image from a “soft copy” displayed on the computer screen. Gross differences in pagination, font use, and other characteristics of the final output are faithfully represented, allowing a user to verify that an output print will have the intended appearance. This type of preview function, however, has not been made available for archival or preservation systems. Nor has it been possible for a user of an archival system to view and select from possible options for image quality characteristics when a document or image is stored or retrieved. This has been due, in large part, to the processing time and cost required in order to show the results of an archival or preservation operation.
While independent archival services exist, the problem of digital archival has largely been a problem to be solved by the company, governmental unit, or other organization needing such a service. It is widely held that outsourcing archival services can be beneficial, lowering the actual cost of such service and improving the overall quality of the archival operation. When archival and preservation of data must be performed by a company, governmental unit, or other organization, the process of archival and preservation remains closely bound to the information content itself. From a business perspective, it can be beneficial to effectively separate ownership of the archival and preservation process from ownership of the information content, thereby allowing an independent vendor to provide archival services to any number of client organizations, while still reserving control and approval of the content to these client organizations.
A difficulty faced by vendors of digital archival and preservation services relates to operational cost in handling each document to be archived or preserved and in performing rigorous quality checks. It would be time-consuming and costly to provide customers with representative images of archived documents for their approval. It is well appreciated that tools for automating customer approval cycles and quality audits help to reduce cost and improve overall operating efficiency and long-term customer satisfaction for document preservation services.
A related difficulty is presented by the cost of sales. Typically, batch processing provides the most economical arrangement for document archival or preservation, with images on rolls of photosensitive media that require chemical processing following exposure. Given conventional workflow constraints, the task of processing and providing demonstration samples to prospective customers interrupts the cost-efficient workflow used for day-to-day operation. Thus, it can be appreciated that there is value in facilitating the process for demonstrating digital preservation system capabilities to prospective customers.
Thus, it can be seen that there is a demand for a digital data preservation system having a preview function that enables a user to view the results of a preservation operation for assessment and approval.