1. Field of the Invention
The present invention relates to a document processing apparatus for processing a structured document, a document type determining method for determining document types, and a hierarchical regular expression determining method for determining hierarchical regular expressions. More particularly, the invention relates to a document processing apparatus for processing documents formed from a plurality of document types, a document type determining method for determining the inclusive or intersectional relationship of the document types, and a hierarchical regular expression determining method for determining the inclusive or intersectional relationship of languages received by hierarchical regular expressions.
2. Related Background Art
In a structured document, contents of the document are called a logical structure and are expressed by a tree structure consisting of a plurality of document constructing elements such as chapter, section, figure, and the like. FIG. 46 is a diagram showing an example of the logical structure. Such a logical structure 101 is not arbitrarily formed but is formed according to a syntax called a document type.
FIG. 47 is a diagram showing an example of the document type. In a document type 102, rectangular nodes define types of elements . The label of each rectangular node shows the name of the element type. The substance of the rectangular node having the same name is of the same element type. The element type having the name of "paragraph" in FIG. 47 is, recursively defined.
Nodes shown in oval define connection of the elements. The oval node is called a constructor. For example, in FIG. 3, a SEQ node shows that every node connecting to the SEQ node is generated sequentially as set forth by the numbers 1 and 2. An REP node denotes that any nodes connected to the REP node is generated on the basis of the document type. An OPT node denotes that a node that is connected to the OPT node does not have to appear, i.e., the node is optional. A CHO node denotes that any of the nodes connecting to the CHO node may be generated on the basis of the document type. The definition of the document type in FIG. 47 is described from the top to the bottom as follows. An "Article" comprises one "section" or more, and the a "section" is constructed by a "title", by zero or more "paragraphs" or "figures", and by zero or more "sections". As mentioned above, the "section" can be nested. The logical structure 101 in FIG. 46 satisfies the construction rules of the document type 102 in FIG. 47.
Although an example of a simple document type is shown in FIG. 47, document types in practical use are quite large and it is not unusual that the number of element types in a document type can reach into the hundreds.
A document type resembles a schema in a database. That is, in a document type, the meaning of the element of the document type and the relationship among the elements are described. As processing of a database is executed according to the schema, a structured document is processed on the basis of information about the document type. For example, a layout instruction is defined according to the document type. Based on the document type, a layout instruction is inputted, and a document layout is performed. In another example, necessary parts are properly extracted from existing documents and synthesized to form a new document. In this case, a new part can be inserted if it is necessary. In such a process, the information pertaining to the document type is used to specify necessary parts in a retrieving process step and to verify whether the newly constructed document is of a desired format in a verifying process step.
After a predetermined time period after the time of design, the request for the document type is usually changed and the document type definition is changed (this is called revision of the document type). In the case of a SGML, by defining entity parameters, referred to as a document type declaration subset, parts of the document type definition can be customized (this is called customization of the document type).
When the document type is used for a long time or in many aspects, by revision and customization of the document type, a number of document types are derived from the same document type, which are similar but different in some aspects. FIG. 48 illustrates a diagram showing an example of a derivation of document types. The solid lines show a derivation of document types by revision and the broken lines show a derivation of a document type by customization. The diagram shows that an original document type S is customized, thereby forming new document types T and U, and the document types S, T, and U are revised. Numbers beside S, T, and U show the number of revisions.
Usually, the user regards the document types derived from the same document type as the same, so that it is often necessary to simultaneously execute a process step such as retrieval of documents formed according to the document types.
However, since the process step of the structured document is described according to the document type, and depending on the revision or customization of the document type, there is the possibility that the process step used for the original document type cannot be applied to a document type defined by a derivation. Since the number of document types are large, as mentioned above, and the rules of definition of the different document types are complicated, it is extremely difficult to grasp the relationship between the original document type and the document types defined by a derivation. Consequently, it is necessary to define and execute the process steps individually for each document type.
The following techniques can be used for the problems addressed above. For example, the following four techniques can be used for the problems addressed above; (1) architectural forms of HyTime (Hypermedia/Time-based Structuring Language; the ISO/IEC 10744:1992); (2) the SDA (SGML Document Access) of ICADD (The International Committee of Accessible Document Design); (3) "Apparatus and method for managing document database" (hereinafter, called a "duplexing of logical structure") described in Japan Published Unexamined Patent Appln. No. Hei 8-190542; and (4) the technique defined in "Document database manager and document database system" (hereinafter, called a "semantic description") described in Japan Published Unexamined Patent Appln. No. Hei 7-319917.
The technique defined in the architectural form of HyTime is described in detail in Chapter 5.2 "Architectural forms" of "Making Hypermedia Work: A User's Guide to HyTime" (Steven J. DeRose and David G. Durand, Kluwer Academic Publishers, 1994). The SDA technique is described in detail in Chapter A.8 "Facilities for Braille, large print and computer voice" of IS012083: 1994.
The above four conventional techniques can be divided into two kinds of methods: (1)normalization of the logical structure and (2)normalization of the element type. Architectural forms, SDA, and duplexing of logical structures belong to the former and semantic description belongs to the latter.
FIG. 49 is a diagram showing processes using the method of normalization of the logical structure. The diagram shows an example where documents are retrieved and a new document is created by synthesizing the results. According to this method of normalizing the logical structure, before executing a common process for documents 111 to 113 of a plurality of document types T.sub.1 to T.sub.n, for example, a document 114 having a logical structure conforming to a specific document type S serving as a reference is formed from logical structures of a documents to be processed, and the formed document 114 is processed.
FIG. 50 is a diagram showing a process using the method for normalization of elements. The diagram shows an example in which documents are retrieved and the results are synthesized, thereby creating a new document. In this method of normalizing the elements, attention is paid to purpose (usage, meaning) of the element type. For example, the element types of documents 121 to 123 derived from a plurality of document types T.sub.11 to T.sub.1N which have the same purpose are regarded as the same. The purpose is either added to the document type definition or expressed so as to be related to the document type definition. When executing a retrieval step, a retrieval expression is converted by using information pertaining to the purpose. Documents 121a to 123a are obtained as the retrieval results having logical structures conforming to the original document types T.sub.11 to T.sub.1n. Consequently, prior to synthesis, it is necessary to preliminarily generate logical structures of the document types as a target to be synthesized from the logical structures obtained by the retrieval step.
Features of the conventional techniques will be respectively described hereinbelow.
The architectural form is a kind of meta-document type for extending a document type definition (DTD) of SGML. Information showing the relationship between an architectural form and the elements or attributes are defined for a document type according to the architectural form. By using the architectural form, the same semantics can be given to a plurality of different elements. For example, in case of the architectural form of HyTime, the semantics of hyperlink which are generally required by hypermedia are expressed.
In order to execute a process using the architectural form, it is necessary to perform a legitimate check of an SGML document and a legitimate check of the meta-document type with respect to each SGML document. The legitimate check of an SGML document checks whether the document conforms to the rule defined by the document type. The check is performed by an SGML validating parser. According to the legitimate check for the meta-document type, element names and attribute names of the document are replaced by names of the architectural form, and whether the resultant can be parsed by the meta-document type is checked. If the replacement of the names are completed, the check can also be executed by the SGML validating parser. Only documents satisfying the two legitimacies can be actual targets to be processed.
The document is processed by using not the document type itself but the architectural form as mentioned above, thereby enabling the document type to be freely designed in a predetermined range.
SDA also extends the document type of SGML in a manner similar to the architectural form and is used to form a document of a preliminarily defined document type (canonical document type) from one SGML document. In a document type using SDA, the relationship between elements of the document type and elements of the canonical document type is described. With respect to the relationship, there are simple one-to-one correspondences and a correspondence according to the context.
When a document of the canonical document type is formed from a document of a document type using SDA, it is also necessary to check the legitimacy as an SGML document. If the document conforms, the document structure is rewritten according to the information added to the document type, thereby enabling the document of the canonical document type to be obtained.
Also, by processing the document by using the canonical document type in a manner similar to the case of the architectural form, the document type can be designed freely in a predetermined range.
In the duplexing of a logical structure, when a document is stored into a database management apparatus, the logical structure in the database is formed from the original logical structure of the document. In the formation of the logical structure, a rule defined by a set comprising the document type of the original document and a document schema as a document type in the database is used. At the time of retrieval, by referring to only the logical structure in the database, a group of documents formed according to a plurality of document types can be uniformly retrieved.
According to the semantic description, when a document schema, equivalent to a document type, in the database is defined, the semantic description expressing the semantics is added to elements of the document schema. The semantic description is defined in the database. A plurality of document types can be retrieved by either forming a retrieval expression in which conditions regarding the semantic description are designated from a retrieval expression in which conditions regarding the document element are designated, conversely, forming a retrieval expression in which conditions regarding the document element are designated from a retrieval expression in which conditions regarding the semantic description are designated, or by a combination of the above forming methods.
The method of normalizing the logical structure (architectural form, SDA, duplexing of logical structure) has, however, the following problems.
First, it is difficult to design the document type as a reference. It is extremely difficult to define a document type as a reference before using the document type, which can withstand revisions or customization for a long time. It is also difficult to define the document type serving as a reference after a plurality of document types are derived by revision or customization.
Secondly, preparation costs are high. The burden on a designer of a document type is heavy when the document type is defined since he has to acquire knowledge regarding the architectural form or SDA to fill the document type with information. According to the duplexing of logical structure, it is necessary to define a rule for forming a logical structure conforming to a document type from a logical structure of a document conforming to a certain document type. Furthermore, according to the duplexing of logical structure, in order to obtain a desired document type as a retrieval result, a rule for forming a logical structure of a desired document type from logical structures conforming to a document schema has to be defined. Since the definition of the rule is complicated, the burden on the end user (database manager) becomes heavier when the database is being operated.
Thirdly, execution costs are high. The normalization of the logical structure is a high cost process. Since logical structures of many documents have to be normalized, processing costs for preparation of a batch process are also high.
On the other hand, the method of normalizing the element (semantic description) has the following problems.
First, settings of the semantic description are difficult. When the semantic description is too general, element types whose purposes are quite different are regarded as the same. Conversely, when the semantic description is too specific, even if the purposes of element types are the same in a sense, they are discriminated as different purposes. Consequently, it is very difficult to set proper semantic description.
Secondly, it is necessary to normalize the structure for a document process. In the semantic description, attention is paid to only the purpose of element types and their lower structure is ignored. Consequently, when a plurality of document types are retrieved by using the semantic description, the original logical structure of the document is derived as a result. In order to execute a synthesis or the like of the retrieved result, it is necessary to form a logical structure conforming to the document type from the logical structure of the retrieved result. Executing costs of the process for forming the logical structure are high. The processing time increases as the number of the retrieval results increases.