1. Field of the Invention
This invention relates to a document database management system that retrieves a structured document from a document database that stores structured documents, and a document database retrieving method for retrieving a document from a document database that stores structured documents. More specifically, this invention relates to a document database management system which manages plural structured documents that are created according to a variety of document types, and a document database retrieval method for retrieving a structured document from plural structured documents that are created according to a variety of document types.
2. Description of the Related Art
A structured document is a document having a logical structure, which is represented as a tree structure comprising structural elements, such as a chapter, a section and a figure. Standardizing logical structures make it easier to share documents and to process document structure.
FIG. 12 shows an example of the logical structure of a structured document. This figure includes an element 31 whose element type is "article" at the top, i.e., the root of the tree, and two elements 32 and 33 whose element type is "section" as children of the root. The element 32 has three children: an element 34 whose element type is "title", an element 35 whose element type is "paragraph" and an element 36 whose element type is "section". The element 33 has three children: an element 37 whose element type is "title" and two elements 38 and 39 whose document type is "paragraph". The element 36 has three children, which are an element 40 whose element type is "title" and two elements 41 and 42 whose element type is "paragraph".
In the typical structured document models, the logical structure is created according to a syntactic rule referred to as a document type. Structured documents have an advantage of ease in document processing, since it is defined by predetermined structural rules. A structured document is simply referred to as "document" hereinafter.
FIG. 13 shows a syntactic rule to which the logical structure of FIG. 12 conforms. In this figure, a square node defines a type of element (element type) and a label of the node indicates a name of the element type. Nodes having the same name are identical
An oval node defines a relationship between elements. This node is referred to as a constructor. There are four constructors comprising "SEQ", "REP", "OPT" and "CHO". The constructor "SEQ" means that instances of the nodes appearing under it should be created in this order. An "instance" of an element type is an element in a document that is created according to the element type. The constructor "REP" means that an instance of a node following thereto is created repeatedly. The constructor "OPT" means that an instance of a node following thereto is not always required to be included in a document. The constructor "CHO" means that an instance of an element node following thereto is always created.
In this figure, the node 51 defining an element type "article" is a root node, the node 52 is a constructor "REP", the node 53 defines an element type "section", the node 54 is a constructor "SEQ", the node 55 defines an element type "title", both nodes 56 and 59 are constructors "OPT", both nodes 57 and 60 are constructors "REP", the node 58 defines an element type "section" and the node 61 defines an element type "paragraph" respectively. These nodes are arrange and connected in their numerical order.
FIG. 13 includes two nodes 53 and 58 whose element type is "section". This means that the element type "section" is defined recursively.
A document database management system managing structured documents stores many documents that are created according to a variety of document types. The document database management system provides a query language that allows users to describe queries to retrieve required documents from the document database. Typically, there are two types of query languages: the one is described in a textual form, and the other is graphically described by using a graphical user interface.
FIG. 14 is an example of a query described graphically. A query for retrieving structured documents can include a condition regarding a structure of a document comprising element types and a relationship therebetween. In the query, each node corresponds to an element of a structured document, and a string in a node indicates an element type thereof. Nodes are connected by an arc each other. An arc drawn by a solid line represents parent-child relationship between elements corresponding to nodes at the both ends of the arc. An arc drawn by a dashed line represents ancestor-descendant relationship between elements corresponding to the nodes at the both ends of the arc. A "parent-child relationship" means that a node in a tree is immediately subordinate to another. An "ancestor-descendant relationship" means that a node in a tree is subordinate to another. Thus, the "ancestor-descendant relationship" includes the "parent-child relationship". The query for retrieving structured documents also can include a condition regarding contents of a document. A string under a node in FIG. 14 indicates that the string should be included in a text held by an element of the node.
Plural arcs come from a node in a query mean that the retrieving result should satisfy all the conditions defined by relationships between nodes. Thus, the query is conjunctive. In this example, a node 71 whose element type is "section" is the root node. A node 72 whose element type is "title" is a child of the node 71, and the node 72 includes a string "document". A node 73 whose element type is "paragraph" is a descendant of the node 71, and the node 73 includes a string "database". This is briefly explained that this query designates instances of "section" each of which has both at least one title including a string "document" as its child and at least one paragraph including a string "database" as its descendant. Here, a "child" means a node immediately subordinate to another in a tree. A "descendant" means a node immediately subordinate to another in a tree. Thus, a child of a node is also a descendant of the node.
Structured documents can be retrieved by using such a query. The query may include an author of a document, a date of creation, a security level, or the like, as retrieving conditions as well.
There are two database management methods: one defines a schema for each document type (referred to as a first conventional system hereinafter) and another defines a unique schema representing an arbitrary logical structure (referred to as a second conventional system hereinafter).
The first conventional system is a common system in a general-purpose database management system such as a relational database management system or an object-oriented database management system. In a database management system using this method, the document type of documents to be retrieved is designated for retrieval, and documents that conform to the schema of the document type are searched.
The second conventional system is, for example, disclosed in the Japanese Unexamined patent publication No.7-44579. Since this method has only one schema, all documents stored in the database system will be prospective documents to be retrieved. Accordingly, even if there are more than one document type for the documents in the database, all stored documents are searched simultaneously.
Document types are very often altered or improved as time lapsed. If there is a long time lag between the time a document type is designed and the time documents are created according to the document type, the document type should be altered according to the change of the requirement to the document type. Document types that are occasionally altered in such a way are used for the same purpose, but structural restrictions of the document types are different from each other.
For instance, when a document type is designed, the following steps are repeatedly executed. It is checked whether a logical structure that the document type specify satisfies requirements to the document type or not. If it does not satisfy the requirements, it is corrected to satisfy the requirement. Thus, the definition of the document type is very often altered.
The design of a document type is strongly dependent on the requirement of the organization that uses the document type. Furthermore, each organization often designs its own document type to make documents appropriate for a particular purpose. However, these document types share structural similarities in many cases. Thus, documents created according to such document types are often exchanged between organizations and are stored in one database.
Here, an example is shown. Suppose that the document type shown in FIG. 13 is used in multiple departments. In a department, an element "cited references" is added to the document type and the document type shown in FIG. 15 is created as a result.
FIG. 15 shows the document type, which is created by adding the element referred to as "cited references" to the document type as shown in FIG. 13. Nodes 81, 84, 86, and 88 through 95 in FIG. 15 correspond to nodes 51 through 61 in FIG. 13, respectively. In this example, a node 82 whose element type is "body text" is inserted between the node 81 whose element type is "article" and the node 84 representing constructor "REP". A node 83 whose element type is "cited references" is newly added as a child of the node 81. Further, a node 87 whose element type is "reference" is connected with the node 83 via the node 85 representing constructor "REP". An instance of "body text" is an element that has instances of "section", and an instance of "cited references" is an element that holds a list of instances of "reference".
The document type shown in FIG. 15 and the document type shown in FIG. 13 have different structural rules, but are used for the same purpose. If the department that creates the document type shown in FIG. 15 creates documents according to this document type, and other departments create documents according to the document type shown in FIG. 13, plural documents that are created according to a variety of document types, but are used for the same purpose, co-exist in one database.
As described above, a database might have plural document types that have different structural definitions, but are used for the same purpose simultaneously. In particular, when a large scale database is used for a long time, such situation is unavoidable. Since documents that are created according to a variety of document types originally have the same purpose for use, such documents are required to be searched at once.
When a document retrieving is conducted to existing database management systems storing plural documents created according to such a variety of document types, the following problems happen.
The first conventional system needs to designate a document type in a query. Therefore, when documents to be retrieved are created according to a variety of document types, it is a burden for users to organize a query and execute a retrieval for each document type. For instance, documents satisfying the condition of the query shown in FIG. 14 can be created according to both document types shown in FIG. 13 and FIG. 15. Therefore, documents created according to both of the document types must be designated as prospective documents in a retrieving process. In this case, each document type must be designated on the execution of each retrieving process. When the number of document types that have the same purpose for use is large, it is a heavy burden on the user.
In the second conventional system, all documents in the system are always designated as prospective document regardless of the number of document types. Therefore, when the user intends to retrieve a document that is created according to a specific document type, all documents in the database including documents created according to the document types that the user does not desire are searched as well. As a result, since precision (a ratio of the desired document to the retrieved documents) decreases, the user should do an extra work to pick desired documents up from the retrieved results.
Further, in the second conventional system, all documents in the system are always designated as prospective document, even if a query includes an element that is included only in a document created according to a specific document type.
For instance, this is explained by a case that a database that includes both documents created according to the document type shown in FIG. 13 and documents created according to the document type shown in FIG. 15 is searched by using a query including an element type "body text".
FIG. 16 shows an example of a query including "body text" as an element. In this query, a node 101 of "body text" is the root node, and a node 102 of "section" is connected to the node 101. Configurations of nodes 102 through 103 are the same as those of the nodes 71 through 73 shown in FIG. 14, respectively. Since this query designates a condition regarding the element "body text", which is not included in the document type shown in FIG. 13, every document that is created according to the document type shown in FIG. 13 never satisfies the query.
However, in the second conventional system, when the query shown in FIG. 16 is designated, all documents are designated as prospective documents, even though a document created according to the document type shown in FIG. 13 never satisfies the condition. Accordingly, the response time of the system increases because of unnecessary processing.