1. Field of Invention
The present invention relates generally to the field of archiving digital information. More specifically, the present invention is related to creating and storing a model of a universal virtual computer enabling recovery of long time archived digital information.
2. Discussion of Prior Art
The report of the Task force on Archiving of Digital Information, commissioned by the Commission on Preservation and Access and the Research Libraries Group states: xe2x80x9cThe digital information is still relatively uncultivated at this stage; but the need is urgent, the time is opportune and the conditions are fertile for a strong, far-sighted set of actions to plant the appropriate seeds to help ensure that the digital record ultimately matures and flourishes.xe2x80x9d The same opinion is also voiced by the industrial sector which sees more and more of their vital data generated and stored in digital form.
There is currently a very limited amount of related activity in the computer science community. This is probably due to the inherent long-term aspect of the problem when so many short term issues may offer a more rapid pay-off.
The following describes some of the technical challenges and prior art solutions.
The problem that libraries are facing today is well known. For centuries, paper has been used as the medium of choice for storing text and images. As shown in FIG. 1, a xe2x80x9cpaperxe2x80x9d document has the advantages of: being a physical object with permanency, remaining readable with a slow degradation rate, remaining understandable (i.e., its structure is known), and being readily available to the reader.
Today, some of the archived objects (books, newspapers, pictures, etc.) are in danger of destruction. What should be done to protect their contents? They could essentially be copied (on paper or microfilm) or digitized. Digitization through a digital camera or a scanner replaces the image by a bit stream. This offers many advantages. First, the object can be copied repeatedly without degradation; its contents can be sent remotely and can be accessed at will. Finally, the physical space needed to store the object becomes smaller and smaller as storage density increases.
Another argument for digitization is that a high percentage of the data to be preserved is, today, generated directly in digital form. Musical CD""s or DVD movies are obvious examples. But the same is true of many engineering designs which were described as blueprints in the past and now exist as digital information in a Computer-Aided-Design system with multimedia, relational database, and virtual reality. And what about all the electronically sent messages that have replaced the memos and letters?
FIG. 2 illustrates an electronic conversion 213 of existing paper text 202 and images 204 (e.g. books 200) and recorded media comprising sound 208 (e.g. records) and/or video 210 (e.g. films) to digital data 216. In addition to converted physical or analog sources, data created by electronic processes 214, such as e-mail, word processors, digital camera, etc.
In the future, the volume of the digital information will increase exponentially and dwarf the volume of the existing paper information. Thus, it makes sense to digitize what needs to be saved of the past, and concentrate on the single problem of preserving digital information for posterity.
FIG. 3 illustrates some of the problems with the storage of information as digital data. A particular storage medium 300, such as a disk, will have a limited physical lifetime. At a later time in the future it is unknown if a machine reader 302 will still be compatible or if the data bit string 304 will remain readable. As technology changes, no guarantees exist for a proper interpretation of bit strings to produce the information they originally represented 306. FIG. 4 illustrates the steps needed to decode the data.
Suppose we use a computer (identified as M2000) to create and manipulate digital information today. For the purpose of archiving the data for preservation, the digital information is stored on a removable medium, say D2000 (most probably some kind of disk). Suppose that, in 2100, somebody (the client) wants to access the data saved today. What mechanism should exist to be able to satisfy the request?
Four conditions must be met:
1. The particular D2000 disk must be found.
2. D2000 must be physically intact.
3. A machine must be available to read the raw contents (bit stream) of D2000.
4. The bit stream must be correctly interpreted.
Condition 1: this is not a new problem; any digital object must be xe2x80x9cpublishedxe2x80x9d under a certain name, catalogued, and stored in a safe place; some attributes may also be stored, such as date, author, title, etc. All this is not different from the data maintained by current libraries.
Condition 2: some researchers predict very long lifetimes for certain types of media, but others are much less optimistic. Anyway, if a medium is good for N years, what about preservation for N+1 years? Whatever N is, the problem does not go away. There really seems to be only one solution to this problem: to copy the information periodically to rejuvenate the medium.
Condition 3: machines that are technologically obsolete are hard to keep in working order for a long time. Actually, this condition is more stringent than the previous one. Here also, rejuvenation is needed, moving the information onto the new medium that can be read by the latest generation of devices. Thus, conditions 2 and 3 go hand-in-hand. It must be noted that rejuvenation is not simply an overhead for preservation; it also allows for using the latest storage technology.
The three conditions above ensure that a bit stream saved today will be readable, as a bit stream, in the future. But there still remains one additional condition.
Condition 4: one must be able to decode the bit stream to recover the information in all its meaning. This is quite a challenging problem.
Digital objects can vary greatly in complexity. A digital object generally corresponds to what we designate as a file today. It contains either data or an executable program. We identify the following three types:
Type 1. A data object may be readily understandable by a human reader, or it may have to be decoded in some way by the reader or by a machine (assuming one knows the decoding rules). In the latter case, a program must be written in 2100 to decode the data, based on the stored description. A text in ASCII, an image, a digital video clip, a table with ASCII fields, are all examples of simple data objects.
Type 2. If the encoding of the data becomes more complex (example: an image compressed by a JPEG algorithm), the best way to describe the algorithm is to store with the data a program that can be used to decode the data.
Type 3. Going a step further, we may also be interested in archiving a computer program or system for its own sake. In this case, it is the ability to run that program that must be preserved by the archiving mechanism. Not only the bit stream that constitutes the program must be archived, but we must also make sure that the code can be executed at restore time. If you want to preserve the look and feel of Window 95 or MAC, or the user interface of a Computer Aided Design system, the only solution is to archive the whole body of code used during the execution, and enough information on how to run the code at restore time.
Below, we lump together types 1 and 2 under the heading of data archiving: this is because the same technique applies to both types. Type 3 is referred to as program archiving.
In Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation, a report to the Council on Library and Information Resources (January 1999), J. Rothenberg sketched out an overall system organization based on encapsulating everything needed to decode the information when needed.
In summary, he proposes to store in an encapsulated object 500:
A. a description of the alphabet used to store text 502;
B. a mostly textual description of the metadata 503 (the semantic of the stored data);
C. the data as a bit stream 505;
D. the program, also as a bit stream, that was used to store and manipulate the data (this program runs on M2000), including, if needed, the operating system and other necessary components 504;
E. the detailed description of the M2000 architecture 504.
In 2100, the client will have to read the metadata B to understand the meaning of the archived information and to know how to run the program D. However, before being able to run D, an M2000 emulator for the M2100 machine will have to be written, based on the description E of the M2000 architecture.
Although we subscribe to the overall idea of encapsulation, we identify three drawbacks of its proposed embodiment:
a. The emphasis on archiving the original executable bit stream of the application program that created or displayed the document (including the operating system). This may be justifiable for program archiving but is mostly an overkill for data archiving. In order to archive a collection of pictures, is it necessary to save the full system that enables the original user to create, modify, retouch pictures when only the final result is of interest for posterity. If Lotus Notes(copyright) is used to send an e-mail message in the year 2000, is it necessary to save the whole Lotus Notes environment and reactivate it in 2100 in order to restore the note contents? But there may even be a worse drawback: the system may display the data that it manages but not necessarily have an export facility. In that case, it would be impossible to get the data out of the old system and into a new one. Actually, what is needed is a program that knows how to get the data of an object, maybe with the needed formatting information, so that it can be transferred to a newer system (a kind of generalized export facility).
b. The need for writing an emulator of an M2000 machine in 2100. First, this is a very complex operation. Second, it has to be done in 2100 for all possible pairs of machines  less than M2000, M2100 greater than . Third, it can be done only if the description of the M2000 architecture is perfect and complete. But even then, how do we know the emulator works correctly since no machine M2000 exists for comparison.
c. The absence of a model for the metadata. Using a textual description of what the data mean and how it is organized requires that the metadata be read before a program may be written to decode the data.
The present invention recognizes that, if the metadata follows a specific model, a general purpose program can query the metadata and automatically decode the data according to the information found in the metadata. In other words, it becomes possible to browse through the data without having to develop a specific program for each data type.
Other prior art includes:
Gilheany (INSPECxe2x80x94xe2x80x9cPreserving Information Forever and a Call for Emulatorsxe2x80x9d, Records Management Bulletin, no.88, pp.23-31, October 1998) discusses the need for preserving information forever. Long term preservation must be able to preserve meta data as well as data and use emulators to permanently preserve the essence of the machines that execute the algorithms that convert abstract data into viewable images. The emulators must reproduce chronologically accurate images printed from common word processing programs.
Giguere (INTERNETxe2x80x94xe2x80x9cAutomating Electronic Records Management in a Transactional Environment: The Philadelphia Storyxe2x80x9d, http://www.asis.org/Bulletin/Jun-97, 6/97) discloses the need for records management for the long time archiving of electronic records. One approach requires that certain information be preserved with electronic files to make them meaningful, creating a self-contained, self-sufficient electronic record packaged into a uniform electronic record data structure. The contextual-information-binding RDR record encapsulation approach will gather the required contextual information from a variety of locations (e.g., operating system, application/platform interface, specifically coded system xe2x80x9ctrapsxe2x80x9d), reformat this information into a standardized data structure and create an electronic record.
The patent to Chan et al. (U.S. Pat. No. 5,339,419) discloses the prior art ANDF approach of using tagged executable code. The software distribution format contains two parts: the executable code in the native computer platform""s matching language and information covering the native computer platform""s machine language (the key).
The patents to Demers et al. (U.S. Pat. No. 5,278,978) and Adair et al. (U.S. Pat. No. 5,416,917) disclose preserving and understanding the data exchanged between dissimilar relational database management systems. The system establishes layers of descriptive information to isolate machine characteristics, levels of support software, and user data descriptions. A different-type database contains predefined descriptions of the machine environments and database language structures for each database with which it can perform distributed database processing.
The patent to Boegge et al. (DE 19613666) discloses a processing system having a data server for both short and long-term archives. An exchange archive connected to the data server holds data models describing the plant process.
Bowdidge et al. (INSPECxe2x80x94xe2x80x9cAutomated Support for Encapsulating Abstract Data Typesxe2x80x9d, SIGSOFT Engineering Notes, v.19, n.5, pp.97-110, December 1994) discloses using a meaning-preserving program restructuring tool that creates a new abstract data type by encapsulating an existing data structure. Data encapsulation simplifies modification by isolating changes to the implementation and behavior of an abstract data type.
Miles (INSPECxe2x80x94xe2x80x9cStructural Realizations of Program Schemataxe2x80x9d, Michigan State Univ., 206 pp.) Discloses using finite state theory to synthesize and detect common program structures (xe2x80x9ccontrolsxe2x80x9d or xe2x80x9cschemataxe2x80x9d) identified as sequential machines. The program computation is described as an interpretation or mapping on these structures.
Nijssen (INSPECxe2x80x94xe2x80x9cStorage and Document Serversxe2x80x9d, Second International Summer School on the Digital Library, pp.77-92, 1997) discusses aspects of long-term archiving of document collections in a digital library for access by specialized historians. Three implementations are discussed: Webdoc, developed by Pica; Science Server by Orion; and Decomate.
As described above, many problems exist with prior art solutions to the long term storage of digital data and future recovery thereof. Whatever the precise merits, features and advantages of the above solutions, none of them achieve or fulfill the purposes of the present invention.
Digital data is preserved by archiving on a removable medium. In the long term, the save data bit stream must be correctly interpreted. For a computer program or system to be archived, the bit stream constituting the program must be archived and the code must be executable at restore time. The program that restores the data does not xe2x80x9cseexe2x80x9d the contents of the data itself, but accesses it by issuing function calls to an executor. A description of which methods are,available to restore the information hidden in the data and what they return is available in the metadata. A text tells the client which functions are available and what their purposes are.
The archiving method is based on using a virtual computer instruction set and saving the algorithm that decodes the data (the method) as a program in that virtual machine language. For machine instructions to be executed many years later, for example 100 years, an emulator of the original machine would be written on the future hardware. Any machine manufactured in the originating year would develop for each architecture a Universal Virtual Computer (UVC) description of the machine. Each originating instruction would be mapped into a small program of UVC instructions. All manufacturers of new architectures would then have to write a UVC executor which would be able to execute UVC instructions on the machine running 100 years in the future. Any invocation of the methods returns data in a certain format. That format must be natural and simple so that it remains relevant in the future. A simple data model is used to describe that format to the future user.