1. Field of the Invention
The present invention relates to a structured document processing technology of disassembling a document into chapters, sections, paragraphs and charts, or plural primitives such as captions, chapter titles, and summaries to represent and handle the document, using a structure such as a tree structure or graph structure with the primitives as nodes, and more particularly to a structured document processing technology of newly synthesizing a document from plural structured documents.
To be more specific, the present invention relates to a structured document processing technology of retrieving document portions (“document parts”) satisfying specific conditions from plural structured documents and inserting or substituting the document parts in other documents for document synthesis, and relates to a structured document processing technology of synthesizing documents without using a script that describes a procedure for extracting document parts from the structured documents, and inserting or substituting the document parts in a document as a template.
2. Description of the Prior Art
It is rare that a document is made up of only strings, and generally it often includes segments such as chapters, sections and paragraphs, and inserted contents such as charts, or primitives such as captions, chapter titles, and summaries.
For this reason, document processing technologies have been developed which disassemble a single document into chapters, sections, paragraphs and charts, or plural primitives such as captions, chapter titles, and summaries to represent and handle the document, using a structure such as a tree structure or a graph structure with the primitives as nodes. Documents thus structured are generally called “structured documents” and can be processed in various ways using computing systems.
In a structured document, a parent-child relationship represented by nodes and links expresses a logical structure of the document. For example, for nodes having attributes such as “chapter title”, “diagram”, and “chapter”, layout processing for printing on a node basis, final copy creation processing, and automatic creation of an abstract collection and a table of contents from the node attributes can be performed.
Presently, as formats for describing structured documents, description languages such as SGML (Standard Generalized Markup Language), XML (eXtensible Markup Language), HTML (Hyper Text Markup Language) are well known. For example, HTML has notations for specifying tables (TABLE) and an item list (UL).
One method of specifying a structure for a plane text file is a method called markup. Markup defines a structure by sandwiching a portion of a document by a tag symbol for indicating the start of a specific logical structure and a tag symbol for indicating the end thereof according to predetermined rules. For example, HTML describes an item list as follows by using <UL> as an item list start tag, </UL> as an item list end tag, <LI> as an item start tag, and </LI> as an item end tag.
<UL>
<LI> item 1 of item list </LI>
<LI> item 2 of item list </LI>
<LI> item N of item list </LI>
</UL>
A method of synthesizing such structured documents is proposed. According to the method, searching is performed based on document structures to retrieve document portions (hereinafter referred to as “document parts”) satisfying specific conditions from plural documents and insert the retrieved document parts in other documents for document synthesis. For example, in a document processing method disclosed as “Document Processing Method and Document Processing Apparatus” in Japanese Published Unexamined Patent Application No. Hei 6-52161 already assigned to the applicant, from a structured document represented in a tree or graph structure, document parts having given attributes are retrieved by select-type instructions specifying the types of document parts (referred to as components in the publication) represented as nodes and inserted into a second document. There is shown an example that specifies, e.g., “Figure” and “Segment” as arguments of the select-type instruction to retrieve figures and sections, respectively. As for insertion of document parts, an example is described which specifies a specific node of a document into which to insert them and inserts them in the last child node of the specified node. By judging parent-child relationships of nodes, a node having a specific structure pattern can be retrieved.
“Akane” produced by Fuji Xerox Co., Ltd. is a document processing application software product based on an structured document editor that operates on a window system. Document processing command sets are provided as tools for Akane. In Chapter 3 “Application Examples” of “Akane Document Operation Command Set Programmers Guide” on pages 2–95 and 2–96, examples are described which retrieve nodes having structures satisfying specific conditions as document parts for synthesis into one document.
In this way, a program in advance specifying a retrieval expression and a program for processing document parts retrieved as retrieval results are coupled by a pipeline, whereby document parts satisfying specific conditions as path pattern expressions can be retrieved from an input original document to synthesize a new structured document and document parts.
As already described, XML is a language capable of describing structured documents. In “XML Development Examples” (Ascii Corp. ISBN-7561-3112-3), there are disclosed the XSL (extensible Stylesheet Language) language that inputs and processes structured documents described in XML, and the handling of structured documents by its processor. The syntax of XSL has the following structure, for example.
<rule>
[pattern]
[action]
</rule>
[pattern] describes a retrieval expression for document parts to be processed. [action] describes processing for retrieved document parts. An example of a retrieval expression is shown below.
<rule>
<target-element type=“section”/>
<element type=“figure”/>
[action]
</rule>
<target-element type=“section”/> indicates that the node type of a document part to be retrieved is “section”, and the next <element type=“figure”/> is a retrieval expression for limiting the document part so as to contain a child node having “figure” as the node type of the document part.
An example shown below is an expression for retrieving document parts that a node type is “employee” and the type of a parent node is “person”.
<rule>
<element type=“employee”/>
<target-element type=“person”/>
[action]
</rule>
In this way, by interpreting and executing a script in advance describing a retrieval expression and action for processing document parts retrieved as retrieval results, document parts satisfying specific conditions can be retrieved from an input original document to synthesize a new structured document and document parts.
A document processing apparatus disclosed as “Structured Document Processing Apparatus” in Japanese Published Unexamined Patent Application No. Hei 7-56920 already assigned to the applicant has a partial structure string extraction part that extracts plural document parts from structured documents and outputs a string of document parts, and a processing execution part that inputs and processes the string of document parts. According to the document processing apparatus thus configured, by separately managing an extraction specification part and a processing specification part, for a change of the structure of an original document, changes of document processing can be confined to only the extraction specification part. For example, it becomes easy to revise and maintain the system and documents in response to the above-described conventional technologies.
The above-described conventional technologies take a system configuration as shown in FIG. 26 or 27. For example, in a structured document processing system as shown in FIG. 26, first, an original document and a template are inputted to an extracting/synthesizing program. The extracting/synthesizing program performs a procedure such as extraction of document parts from the original document and the insertion and substitution of the document parts in the template according to an extraction/synthesis script described in script format, and generates a synthesized document.
In a structured document processing system as shown in FIG. 27, an extracting program extracts document parts from an original document according to a procedure described in an extraction script. The extracted document parts are inputted to a synthesizing program along with a template. The synthesizing program performs the insertion and substitution of the document parts in the template according to a procedure described in a synthesis script, and generates a synthesized document.
Any of these above-described technologies uses a script (extraction script) describing a procedure for extracting document parts from an original document and a script (synthesis script) for inserting or substituting document parts extracted from the original document into a template which serves as the base of a document outputted as a result. In other words, these conventional technologies, which require management of these scripts, in addition to the original documents, have the problems described below.
1. To retrieve document parts from an original document, a retrieval expression for locating the structures and patterns of the document parts must be described in a script. Therefore, changes of the structure of the original document involve corresponding modifications of the retrieval expression in the script.
2. To process a mixture of plural original documents that are structurally different, a different script must be prepared for each of different structures.
3. It is difficult to describe as a procedure a retrieval expression for locating the structures and patterns of document parts and steps for processing the retrieved document parts. Generally, the procedure that “document parts satisfying condition A are retrieved and procedure B is performed for obtained results” must be described as a script. Where the number of document parts depends on original documents, commands must be described as a script, to perform repetition processing (e.g., insertion and substitution) by combining repetition instructions such as “for” and “repeat” statements to count the number of document parts and perform repetition, and instructions (e.g., insertion instruction) to actually perform desired processing. The creation of such a description and script requires as much knowledge as required for programming, probably making widespread use among general users difficult.
4. No reference is made to a mechanism for easily reusing intermediate results of document processing (e.g., document parts extracted by retrieval processing). For example, the above-described Akane requires that a script to save intermediate results in a file is explicitly described.
The first problem is in the point that, where the structure of an original document is changed, all scripts to process the original document must be searched for, and the negligence to modify the scripts causes document processing to malfunction.
The second problem is in the point that, for each of input documents different in structure, the development and maintenance of a dedicated script to process the input document require much expense in time and effort. Users must select an appropriate script for use. If an appropriate script is not selected, an improper operation of document processing or other malfunctions might occur.
With the third problem, it is extremely difficult for users themselves to implement a document processing system capable of creating structured document processing applications serving individual purposes.
The fourth problem is in the point that the development of efficient applications requires much time.