The present invention relates to a document generic logical information editing apparatus for editing document generic logical information generically defining the logical structure of a plurality of structured documents.
Recent advances in information processing equipment such as word processors, workstations and like devices are accompanied by the widespread use of electronic documents. As a result, it has become necessary to process large amounts of electronic documents automatically with relevant techniques. To carry out such processing requires that the target electronic documents, just as they are structured to let each of their text elements suitably express its meaning, should have a data structure allowing the multiple documents to share generically logical meanings therebetween.
The so-called structured document architecture stipulates that any one of its component documents possess not only its proper textual information but also document generic logical information expressing logical-unit meanings common to a plurality of documents making up the architecture. This kind of document architectures includes the SGML (Standard Generalized Markup Language; ISO 8879) and ODA (Open Document Architecture; ISO 8613).
In such document architectures, the document generic logical information (i.e., logical structure) is expressed by a directed sequential graph structure having "types" and "constructors" as its nodes. The types are elements that give logical meanings to parts constituting a document. The constructors are connectors that represent the relations between the types. In the SGML, such document generic logical information is expressed illustratively as a DTD (document type definition). The types and constructors are expressed as elements and connectors respectively, represented by an occurrence indicator each. In the ODA, for example, document generic logical information is expressed in the form of a generic logical structure. The types and constructors are expressed as logical object classes and construction terms, respectively.
In the above-described document architecture SGML, the document generic logical information representing the logical-unit structure of the elements constituting a document "summary report" is described illustrative in the following DTD (document type definition):
______________________________________ &lt;|DOCTYPE summary report &lt;|ELEMENT summary report O (title, author, abstract, paragraph+)&gt; &lt;|ELEMENT title O (#PCDATA)&gt; &lt;|ELEMENT author O (#PCDATA)&gt; &lt;|ELEMENT abstract O (#PCDATA)&gt; &lt;|ELEMENT paragraph O (#PCDATA)&gt; !&gt; ______________________________________
where, the symbol "," is a "seq" (sequential) constructor (connector) designating the sequence in which all elements must appear, and the symbol "+" is a "plus" constructor (occurrence indicator) indicating that the element in question has occurred at least once and may appear repeatedly. Also available but not shown are an "and" constructor represented by a symbol "&," an "or" constructor (connector) denoted by a symbol ".vertline.," and an "opt" constructor (another occurrence indicator) represented by a symbol "?." The "and" constructor (connector) designates that all elements may occur in any sequence, whereas the "or" constructor (connector) stipulates that only one of the elements must occur. The "opt" constructor (occurrence indicator) indicates that the element has occurred once or it need not occur again.
The symbol "&lt;|" prefixed to each of the statements making up the document type definition is a markup declaration delimiter. The words "DOCTYPE" and "ELEMENT" following the delimiter without a space interposed are an element declaration keyword each. Specifically, the word "&lt;|DOCTYPE" in the first statement is a reserved word designating the document type definition for the description that follows. Regarding elements of the document, the keyword "&lt;|ELEMENT" prefixed to each statement works as a reserved word having the content of the document element structure (called a lower structure) designated by the description that follows. The names of the items described next (summary report, title, author, abstract, paragraph, etc.) represent the names of the logical units in the elements of the target document.
The subsequent symbols ("- -," "-.smallcircle.," ".largecircle..smallcircle.," etc.), as will be discussed later, are symbols indicating whether or not delimiter tags (a start tag followed by an end tag) designating the object of the item in question are omissible in the order indicated. The symbol "-" means that the tag corresponding thereto is not omissible, whereas the symbol ".smallcircle." indicates that the corresponding tag is omissible. For example, if the symbols are shown as "- .smallcircle.," it means that the start tag is not omissible while the end tag is omissible.
The word "#PCDATA" in the lower structure of the document elements is one of the reserved words of the SGML. In structural terms, the word means that the content is character data. In the above example of defining the document type for the document "summary report," the word "#PCDATA" indicates that the content of the structure including the elements "title," "author," "abstract" and "paragraph" is character data.
In the document type definition (DTD) above, the generic identifier "summary report" and the identifiers "title," "author," "abstract" and "paragraph" with respect to the logical-unit elements of other documents may all be regarded as type names. In that case, the document generic logical information "summary report" takes on a directed sequential tree structure having types and constructors as its nodes, as shown in FIG. 6. Type names generally recur, so that the data structure composed thereof is rarely a directed sequential tree structure; it most often turns out to be a directed sequential graph structure.
Shown illustratively below is a typical SGML document defined by the DTD of the document generic logical information representing the logical structure of the above document "summary report":
&lt;summary report&gt; PA1 &lt;title&gt; SGML study report &lt;/title&gt; PA1 &lt;author&gt; Taro Fuji &lt;/author&gt; PA1 &lt;abstract&gt; This report reports on . . . &lt;/abstract&gt; PA1 &lt;paragraph&gt; This is the first paragraph. &lt;/paragraph&gt; PA1 &lt;paragraph&gt; This is the last paragraph. &lt;/paragraph&gt; PA1 &lt;/summary report&gt; PA1 &lt;|DOCTYPE report PA1 &lt;|ELEMENT report - .smallcircle. (title, author, history, abstract, paragraph+)&gt; PA1 &lt;|ELEMENT title - .smallcircle. (#PCDATA)&gt; PA1 &lt;|ELEMENT author - .smallcircle. (#PCDATA)&gt; PA1 &lt;|ELEMENT history - .smallcircle. (date?, updater?, date+)&gt; PA1 &lt;|ELEMENT abstract - .smallcircle. (#PCDATA)&gt; PA1 &lt;|ELEMENT paragraph - .smallcircle. ((paragraph heading, paragraph content+) +&gt; PA1 &lt;|ELEMENT paragraph heading - .smallcircle. (#PCDATA)&gt; PA1 &lt;|ELEMENT paragraph content - .smallcircle. (#PCDATA)&gt; PA1 !&gt; PA1 &lt;report&gt; PA1 &lt;title&gt; SGML study report &lt;/title&gt; PA1 &lt;author&gt; Taro Fuji &lt;/author&gt; PA1 &lt;abstract&gt; This report reports on . . . &lt;/abstract&gt; PA1 &lt;history&gt; PA1 &lt;date&gt; OCT/20/'93 &lt;/date&gt; PA1 &lt;date&gt; OCT/10/'93 &lt;/date&gt; PA1 &lt;/history&gt; PA1 &lt;paragraph&gt; PA1 &lt;paragraph heading&gt; 1. Paragraph 1 &lt;/paragraph heading&gt; PA1 &lt;paragraph content&gt; This is the first passage of Paragraph 1. &lt;/paragraph content&gt; PA1 &lt;paragraph content&gt; This is the last passage of Paragraph 1. &lt;/paragraph content&gt; PA1 &lt;paragraph heading&gt; 2. Paragraph 2 &lt;/paragraph heading&gt; PA1 &lt;paragraph content&gt; This is the first passage of Paragraph 2. &lt;/paragraph content&gt; PA1 &lt;paragraph content&gt; This is the last passage of Paragraph 2. &lt;/paragraph content&gt; PA1 &lt;/paragraph&gt; PA1 &lt;/report&gt;
The parts above such as &lt;title&gt;, &lt;/title&gt;, etc. each beginning with a symbol "&lt;" and ending with a symbol "&gt;" are tags. They are delimiters used to delimit the elements (also called entities) of the document. Illustratively, there are two kinds of tags: a start tag indicating the start of the description of a document element (e.g., &lt;title&gt;), and an end tag denoting the end of the description of that document element (e.g., &lt;/title&gt;).
The part enclosed by the start tag &lt;summary report&gt; and the end tag &lt;/summary report&gt; becomes what is known as an instance of the document "summary report" with respect to its document type definition (DTD) class. For each element of the document, the part enclosed by the start tag (&lt;title&gt; and the end tag &lt;/title&gt; is the element that becomes "title." Likewise the part enclosed by the start tag &lt;abstract&gt; and the end tag &lt;/abstract&gt; is the element that becomes "abstract," and the part enclosed by the start tag &lt;paragraph&gt; and the end tag &lt;/paragraph&gt; is the element that becomes "paragraph."
Conventionally, structured documents expressed as described above are edited by an editor (i.e., document editing apparatus) constituted so as to edit the target documents by utilizing document generic logical information which is embedded in the editor and which is common to these documents. In another arrangement, an analysis processor is activated to read the description of document generic logical information typically in text form before analyzing the document in question in accordance with the document generic logical information thus read. Such an analysis processor constitutes part of the editor for editing specific documents. These arrangements make it possible to create a specific set of documents based on the same document generic logical information. These sets of structured documents may be subjected to various kinds of automatic processing. For example, only titles may be extracted from a plurality of documents having as their document generic logical information the document type definition of the above document "summary report." The extracted titles may be used to prepare a summary. It is also possible to search through the documents using a character string contained in an abstract.
Each document is generated on the basis of the type (document type definition) and constructors constituting document generic logical information. The logical meaning (document structure) of a particular document thus generated is defined by the document generic logical information common to a plurality of documents. However, even if each document is generated on the basis of such logical information, it may not be possible to extract as a tree structure any of those logical units from a specific document which are defined by the document generic logical information.
For example, the "paragraph," an element of the document "summary report" based on the above-described document type definition, may be structurally defined in more detail. Specifically, the paragraph may be constituted by a paragraph heading and by repetitions of at least one paragraph content. An example of such definition is as follows:
The provisions of the above document type definition illustratively allow a specific document of the following document structure to be created:
In the structure of the above document, the document elements which contain a character string "Paragraph 2" in the "paragraph content" and which are logical units that may be extracted are "paragraph content," "paragraph" and "report." In this logical structure, the portion ranging from the "paragraph heading" containing the character string "Paragraph 2" to the "paragraph content" that follows it constitutes a logical unit. According to the description (document generic logical information) of the document type definition (DTD) of the SGML, however, there exists no node indicating any logical agglomeration. Thus such a logical unit cannot be extracted as a tree structure.
Traditionally, the document generic logical information for defining the above document structure is used primarily for text styling purposes in editing documents. Thus under constraints of the document generic logical information, there have been few problems even if it is impossible to extract a logic unit as a tree structure from structured documents.
However, automatic document processing such as search, manipulation and composition is not efficiently available if the elements of documents are not conducive to being handled according to a structured document structure. If it is impossible to extract as a tree structure any logical unit of a specific document during automatic processing, there are two options: either to extract the logic unit in whatever structure is currently available (i.e., other than the tree structure), or to extract a tree structure hierarchically higher than, and inclusive of, the logical unit that needs to be extracted.
In the first option, the target logical unit may be extracted illustratively as a string or a tuple of a plurality of tree structures. In that case, the user is required, upon such extraction, first to designate the applicable part using a string, a tuple, etc. more complicated than the tree structure and then to describe the necessary processing based on the particular structure of the string, tuple, etc.
In the second option above, the extracted tree structure contains unnecessary information other than the desired logical unit. This requires executing a script describing the processing by which to extract only the necessary structure. Either of these options involves extra chores and additional tasks such as the transfer and/or the removal of superfluous data.
In recent years, word processors have been required to let users submit their individual documents to automatic processing. However, conventional word processors have yet to deal sufficiently with the editing of specific documents in accordance with the logical structure stipulated by the document generic logical information embedded in the editor of the processors. It is thus necessary first to read the description of document generic logical information, next to edit the target document, and then to edit the document with respect to its logical structure. Another problem with conventional editors is that they have yet to support the preparation of document generic logical information allowing a logical unit to be extracted as a tree structure from a particular document.
There are cases where a specific document is prepared on the basis of the types and constructors constituting a document generic logical structure, the document being subsequently interpreted in terms of the document structure. In some of these cases, the document in question may not be interpreted uniquely in accordance with the document generic logical information. For example, if the "paragraph" in the document type definition of the above "summary report" is not divided into the "paragraph heading" and "paragraph content," then the document type definition will be given as follows:
______________________________________ &lt;|DOCTYPE report &lt;|ELEMENT report - O (title, author, history, abstract, paragraph+)&gt; &lt;|ELEMENT title - O (#PCDATA)&gt; &lt;|ELEMENT author - O (#PCDATA)&gt; &lt;|ELEMENT history - O (date?, updater?, date+)&gt; &lt;|ELEMENT abstract - O (#PCDATA)&gt; &lt;|ELEMENT paragraph - O (#PCDATA)&gt; !&gt; ______________________________________
The provisions of the above document type definition illustratively allow a particular document of the following document structure to be created:
______________________________________ &lt;report&gt; &lt;title&gt; SGML study report &lt;/title&gt; &lt;author&gt; Taro Fuji &lt;/author&gt; &lt;abstract&gt; This report reports on . . . &lt;/abstract&gt; &lt;history&gt; &lt;date&gt; OCT/20/'93 &lt;/date&gt; &lt;date&gt; OCT/10/'93 &lt;/date&gt; &lt;/history&gt; &lt;paragraph&gt; &lt;paragraph&gt; This is the first paragraph. &lt;/paragraph&gt; &lt;paragraph&gt; This is the last paragraph. &lt;/paragraph&gt; &lt;/report&gt; ______________________________________
In the document of the above structure, each of the elements "date" in the document element "history" is interpreted by use of the logical structure provisions &lt;|ELEMENT history - .smallcircle. (date?, updater?, date+)&gt;. In that case, there can be two interpretations. The first interpretation is that "&lt;date&gt;OCT/20/'93 &lt;/date&gt;" corresponds to "date?" and &lt;date&gt; OCT/10/'93 &lt;/date&gt; to "date+." The second interpretation considers that &lt;date&gt; OCT/20/'93 &lt;/date&gt; and &lt;date&gt; OCT/10/'93 &lt;/date&gt; both correspond to "date+."
Thus according to the first interpretation, "&lt;date&gt; OCT/20/'93 &lt;/date&gt;" represents the last update date and the last updater is omitted. By the second interpretation, "&lt;date&gt; OCT/20/'93 &lt;/date&gt;" is simply an update date other than the last update date, and the last update date and the last updater are omitted.
As outlined, the logical information on documents is used conventionally for text styling purposes and does not necessarily stipulate the unique way in which each of these documents is to be interpreted. However, unless a specific document is always interpreted uniquely for automatic processing such as search, manipulation and composition of document elements, the following problem will occur:
For automatic processing, logical information is utilized in accordance with the logical meaning of the document structure. This means that if a particular document is not interpreted uniquely with respect to a generic logical structure, there cannot be a unique meaning ascribed to the logical information on the document in question. Illustratively, an automatic search based on logical information requires carrying out complex inquiries in order to obtain the desired result. To perform an automatic editing process requires executing a complex script. Meeting these requirements involves the user performing extra chores, which means that automatic processing aimed at acquiring the desired results is impossible. For example, it is virtually impossible automatically to carry out a process of extracting, from a set of documents having the above-described generic logical information, a summary composed of the element titles and the last update dates of all documents each having its last update date designated.