Most large organizations today have a class of document information which drives critical business processes and is of the highest value. For example, the maintenance of large airplanes is derived by document information of 100,000 pages. There is no doubt that the management of such document information is a key for success. Document information is typically organized for particular purposes, namely, driving presupposed business processes. However, the same information can be, if reorganized, utilized for different purposes. For example, Northwest may want to reorganize document information provided by Boeing and McDonnell-Douglas so as to efficiently maintain their airplanes in their own manner.
Many document models, notably SGML, introduce tree structures to documents. Document reorganization, the generation of new documents by assembling components of existing documents, can thus be computerized by writing tree transformation programs.
Furthermore, some document models, including SGML, introduce schemas of documents. Schemas describe which types of nodes may appear in documents and which hierarchical relationships such nodes may have. Such described information helps programmers to write tree transformation programs. A schema is typically represented by an extended context-free grammar. A tree is an instance of this schema if it is a parse tree of that grammar.
Some programming languages have been proposed to handle SGML documents. Scrimshaw (Arnon, "Scrimshaw, A language for document queries and transformation" Electronic Publishing--Origination, Dissemination, and Design 6(4):385-396, December 1993) and Omnmiark (Exoterica) are examples. An operator in such languages has a pattern and/or contextual condition as parameters (although Scrimshaw has patterns only, and Omnimark has contextual conditions only). The nodes to be handled by that operator are located by the pattern and/or contextual condition, as shown in FIG. 5.
As used herein, a pattern is a collection of conditions on a node and its directly-or-indirectly subordinate nodes. Each condition concerns the type, content, or attributes of a node. As an example, consider a pattern: figures whose captions contain the string "foo". It can be more formally stated as a node A such that
1) A is a figure, and PA1 2) A has a subordinate node B such that PA1 1) D is a section, and PA1 2) D has a subordinate node E such that PA1 A tree is accepted by M if and only if it is an instance of the input schema. PA1 The states of M are classified into marked states and unmarked states. PA1 A node matches the pattern if and only if M associates a marked state with this node (in successful computation).
2-1) B is a caption, and PA2 2-2) B contains the string foo. In the example, 1) is a condition on nodes and 2) is a pair of conditions on subordinate nodes. PA2 2-1) E is a title, and PA2 2-2) E contains the string "bar". In this example, 1) is a condition on a (possibly indirect) superior node and 2) is a pair of conditions on a subordinate of this superior node.
A contextual condition is a collection of conditions on superior nodes, sibling nodes, and subordinates of the sibling nodes. For instance, assume that in the previous example we are interested only in those figures in sections entitled "bar". Then, we want to introduce a contextual condition: the figure must be directly or indirectly subordinate to a section node such that its title node contains the string "bar". It can be more formally stated as a node A has a (possibly indirect) superior node D such that
An operator specifies the types of operations such as delete, insert, reorder, copy, etc. For example, the delete operator specifies that the subtree located by the pattern and the contextual condition be removed from the input document.
A serious drawback of document programming languages is that DTD's (Document Type Definitions) for output documents are not generated. However, DTD's are mandatory for further utilizing the output documents.
To avoid this problem, the programmer writes a DTD for output documents in advance. He or she then writes programs carefully so that output documents should conform to this DTD. Obviously, this process becomes difficult if programs become complicated. Moreover, conformity is never ensured. The programmer can test the conformity for some output documents, but the next output document might not conform.
In Colby, "An algebra for list-oriented applications", Technical Report TR 347, Indiana University, Bloomington, Ind., December 1992, propose a model for list-oriented database systems that can be used for documents. Like document programming languages, operators in this model have patterns. Furthermore, a schema can be generated for output documents, as illustrated in FIG. 6. That is, given a pattern, an input document, and an input schema, each operator generates not only an output document but also an output schema. It is ensured that each output document conforms to the output schema. A serious drawback of this model is that output schemas are sufficient, but not minimal. In other words, aside from documents that can be generated from the program (called "query"), many other documents also conform to the output schema. Such loose schemas contradicts the original intention of having schemas. That is, they do not help programmers.
Another problem in the above reference is that only simple schemas are allowed. Though a scheme is an extended context free grammar, the right-hand-side of each production rule is limited to either a sequence of non-terminals, a non-terminal followed by the star operator, or a value (type). As a result, to store an SGML document conforming to a DTD in the proposed database system, the DTD has to be converted to a very different schema, and the document has to be significantly changed accordingly.
Furthermore, contextual conditions are not allowed and patterns are weak: they may concern direct subordinates but may not concern indirect subordinates. As a result, some operations, which would be straightforward if powerful patterns and contextual conditions were available, are hard to implement.
United States Patent Application Ser. No. 08/367,553 entitled "System for Data Format Modification and Instance Translator Construction" by An Feng, (hereinafter, Feng) and assigned to the same assignee as the present invention, (currently pending) is also of interest. Although only the conversion of SGML-documents is mentioned as a motivation, the invention claimed in this patent application applies to the transformation of documents in general, as illustrated by FIG. 7.
This reference shows that, given an input DTD and a sequence of pattern-parameterized operators, a system generates an output DTD and a conversion program (called "instance translator"). This program in turn generates an output SGML document conforming to this output DTD from an input SGML document conforming to the input DTD. Unlike the Colby's model, any extended context-free grammar is allowed as a schema. However, this invention is not free from the other problems of Colby's model. Output schemas are sufficient but not minimal. Contextual conditions are not allowed. Patterns are weak: conditions on direct superiors are allowed but conditions on indirect superiors are not.
The references described above are incorporated by reference for their teachings.