1. Field of the Invention
The invention relates to systems for transforming data, such as, for example, a file or document format, from one data format to another data format and, in particular, to such systems for use in a heterogeneous computer system. The invention also relates to methods for transforming data.
2. Background Information
Most computer systems are a heterogeneous environment of software, hardware and communication networks. Electronic data is generated by a user employing software known as applications (e.g., word processors) or by devices (e.g., digital cameras, electrocardiogram machines). The electronic data is often referred to as a document or a file. This electronic data may be displayed (e.g., word-processing document, image file, video file), played (e.g., voice file, music file), processed or sent to another point in the computer system to be displayed, played, processed, printed, or transmitted (e.g., by facsimile).
Regardless of its purpose, the electronic data undergoes numerous transformations using a variety of software and hardware. Often, during each of these steps, the language (or the format) of the electronic data is interpreted and then transformed to different format, which is suitable for the next step in the transformation. Sometimes, interpretation of the data format may not be necessary when the data is simply encoded or encapsulated perhaps for transmission across the network or encrypted for security. Quite often, the transformation may be accomplished by an expert in interpreting and translating the electronic data, such as, for example, voice transcriptions and filling standard forms in an office environment. In essence, electronic document transformation is essential and ubiquitous in a networked world. The required transformations may be completely automatic, as is the case, for example, when an e-mail message/attachment undergoes numerous transformations in a sequence decided by pre-configured sequencing of software and hardware tools. The transformation may also be initiated by a user, for example, as in the transformation of word processing documents to PDF (Portable Document Format) or through printing of such documents by a printer.
Not all transformations are always straightforward or even feasible. Rapid innovations in information technology have resulted in the proliferation of newer and better representations and delivery of data. Usually, each new representation of data requires yet another set of software and/or hardware solutions to transform the data. Hence, there may be a substantial delay in acquiring such solutions and updating a computer system. This problem is further complicated by the heterogeneity of the computer infrastructure. This heterogeneity is the result of varied computing platforms based on different architectures (e.g., x86 architecture of most PCs, PowerPC architecture of Macs, a variety of RISC and CISC architectures of servers from SUN and IBM), a variety of operating systems for these architectures (e.g., Windows, MacOS, SunOS, Linux, IBM AIX) and a variety of applications using these operating systems. The same heterogeneity exists within input and/or output devices such as printers, facsimiles, medical instruments and wireless devices.
The imposition of standards on data representations and formats can alleviate some of this problem. However, standards are very few and far between mainly because most information technology manufacturers deem it advantageous to develop proprietary formats. Also, no single standard can effectively deal with all possible combinations of formats even for a limited application area such as word-processing.
Using a standard intermediate format in a two-stage transformation system for electronic documents is widely known. For example, in an Electronic Data Interchange (EDI) system, where the creator knows the formats, the transformation system can be a two-step system. See U.S. Pat. Nos. 5,202,977; and 5,701,423. The reliability of such systems can be ensured if and only if all the file input formats of interest can be reliably transformed to one known format, which, in turn, can be transformed to the desired output format. In some systems where the formats are known, or publicly known, this is a feasible scheme, although it requires a centralized transformation system with a completely defined software/hardware environment. Centralized systems, however, are vulnerable to failure and are hard to scale and manage with increasing load.
The application which generates a file is often best suited for interpreting it and, in many cases, transforming the file to other standard formats (e.g., files created by Microsoft Word (MS Word) can best be read by MS Word and translated into Rich Text Format (RTF)). Thus, the use of native applications may be a better solution than the use of third-party software, which may cause some loss or modification of content during interpretation and translation. While best suited for the task, native or first-party software may be prohibitively expensive, especially if a majority of the users are only using the software to view files in a specific format. In such a case, using an instance of the software as a service to convert files to a common readable format provides a cost-effective technique for sharing information.
In addition, third-party software may be faster, cheaper and/or provide a wider range of document formats than native applications. Further, the loss/modification of content may be acceptable depending upon the purpose of the final document. U.S. Pat. Nos. 6,092,114 and 5,283,887 disclose the use of native and third-party applications in a centralized transformation system in an e-mail environment, in which the transformation process is a single-step process using a known application. The process of discovering any change in availability of new transformation applications is through static configuration files (i.e., files that cannot be updated while the server is operational) at a central server (i.e., one master controlling server).
One step transformations based upon static configuration files are a problem because software versions and electronic file formats are being rapidly upgraded. This results in incompatibility between versions of the same application. For example, a PowerPoint version 4 document is not compatible with the latest PowerPoint version in Office2000. One has to first use Office97 to convert the former document to an intermediate format that can then be converted to the latter format. Thus, there is an ever increasing volume of legacy documents that become unreadable unless one has access to the original application that created it. Also, some applications preclude the simultaneous use of multiple versions on the same computer.
U.S. Pat. No. 5,251,314 discloses multiple application versions, in which a single application performs the same transformation. The substeps required for a full transformation are pre-determined and stored in centralized static configuration files. The available transformations are also centrally stored. Recovery of the original file is provided.
U.S. Pat. Nos. 5,251,314; 5,299,304; and 5,513,323 disclose the use of plural applications or stages to perform a more complex transformation through a series of conversions. The system of U.S. Pat. No. 5,251,314 finds out whether a requested transformation can actually be performed and in how many stages. This determination is used to a priori decide whether to proceed with the requested transformation. In a real system, the number of stages of transformation is not the only issue in deciding which sequence of transformation is more efficient or less expensive. U.S. Pat. No. 5,513,323 discloses a similar problem except that the goal is to find the sequence with the least cost. A fixed cost is associated with each primitive transformation and a sequence of primitive transformations is chosen.
Ockerbloom, J., “Mediating Among Diverse Data Formats,” Carnegie Mellon University, pp. 1–145 (1998), discloses a conversion system including clients, servers, and mediators called type brokers. The servers provide data and perform operations on the data, such as conversions, method executions and attribute fetches. The clients retrieve data and request operations on the data. The type brokers take client requests and find servers that return data and operation results that clients seek.
There is room for improvement in methods and systems for transforming data in heterogeneous computer systems.
In summary, the task of transforming documents is costly, tedious and time-consuming. Many people regularly face this problem. This is especially true inside an organization where the transformation needs may be specific to the type of documents that are handled daily. There is no known solution that systematically manages this problem inside a networked organization or for loosely connected communities on the Internet. Ideally, such a system should be able to grow as the transformation needs of an organization increases, new document formats and software are introduced, and existing software versions change.