1. Field of the Invention
The present invention generally relates to management of documents having a regular document format such as legal documents, and particularly to a method and apparatus for generating a structured document from a non-structured document. The "non-structured document" means a document which does not contain information explicitly showing the structure of a document entered through character recognition, a word processor, or the like. The "structured document" is a document which contains information explicitly showing the structure of the document.
2. Description of the Related Art
In a known method of generating a structured document, information explicitly showing the document structure is embedded in a text. Generally, a document generated by a user (hereinafter called a "document instance") often contains a portion for designating a file which describes a document structure definition and a text content portion. The document structure definition defines the document structure and a mark indicating an element (the mark is hereinafter called a "tag"). The document structure definition is often set in order to efficiently use a document to be structured. The tag defined by the document structure definition is inserted into the text content portion in order to explicitly express the document structure and uniquely determine a string which is an element of the document structure indicated by the tag.
In outputting a document instance structured in the above manner, an image to be output is generated by referring to a file which describes a layout definition defining what format is used for outputting each component (hereinafter called an "element") of the document structure. In this method, the document instance and the layout definition are independent so that any document instance can be used irrespective of the type of an apparatus or system to be used for the output.
The contents of a string of a structured document are explicitly expressed by inserting a tag such as &lt;author name&gt; and &lt;title&gt; which is in one-to-one correspondence with an element. Therefore, in combination with a tool such as a full text search system for structured documents, an aggregation of document instances themselves can be used as a database, and the document contents can be added or changed easily. Even if part of this database is lost by some failure, it is possible to know that this database has a lost portion, by comparing the original document structure definitions with the database of document instances.
Because of these advantages, structured documents are widely used for document management of a document processing system which stores and uses a large number of documents. Along with this, several approaches have been proposed to convert a non-structured document such as already present paper documents and documents entered by a word processor, into a structured document.
JP-A-62-249270 and "Method of Converting Document Image into ODA Structured Document" (Journal of Papers of The Institute of Electronics, Information and Communication Engineers, D-11 Vol. J76-D11 No. 11 pp. 2274-2284) propose the following method. First, the field of a document type of a document is restricted. Next, a structured document is generated by using a document structure common in the restricted field (hereinafter called a "common document structure") and a document structure analysis rule.
With this method, the document structure usable in common in each field of a document such as "technical document" and "business document" is set. Then, the document structure analysis rule is manually generated in order to analyze a non-structured document and extract a document structured of it. By using the document structure analysis rule, the non-structured document is converted into a document instance matching the common document structure. If there is an element, which is specific to each document structure and unable to be expressed by the common document structure (hereinafter called an "individual document structure"), the document instance matching the common document structure is converted into a document instance matching the individual document structure.
With this method, however, the document structure subjected to the document structure analysis and the document structure analysis rule are dependent upon the field of a non-structured document. Therefore, in order to process a document in a different field, the document structure analysis rule for this field is required to be newly generated manually. This work requires a large amount of labor.
This method uses a single document structure analysis rule considered to have high commonness in a plurality type of documents in a specific field. Therefore, this single document structure analysis rule is not always optimum to each document and an element specific to an individual document structure cannot be analyzed directly. In this case, it becomes necessary after the document structure analysis to convert again the document instance into another document instance matching the individual document structure. Specifically, tags of the first generated document instance are added, changed, or deleted. This work generally requires complicated operations and hence a large amount of labor.
Further, this method does not consider a support to generate a rule for extracting a keyword. Therefore, an element as a keyword is required to be manually determined and the conditions of layout and string necessary for extracting a keyword is also required to be manually set.
Still further, this method does not provide means for supporting to determine an element as a keyword (hereinafter called a "keyword-corresponding element"). Elements which contain string data are not always extracted as keywords. Elements having no characteristic layout or string are not extracted as keywords, but they are dealt as a string between keywords, i.e., a non-keyword.
The restriction condition that "non-keywords should not be contiguous in a document instance" is imposed when which element is determined to be a keyword-corresponding element. This is because the non-keyword is a "string between keywords" and the non-keyword is required to be always contiguous to a keyword. However, conventional methods have no means for automatically checking whether an aggregation of elements determined as keyword-corresponding elements satisfies the restriction condition. If the aggregation of these keyword-corresponding elements does not satisfy the restriction condition, some defective or erroneous conditions occur when the rule for document structure analysis is generated or when the document structure is analyzed. It is therefore necessary to determine again keyword-corresponding elements. This cycle is required to be repeated until an aggregation of proper keyword-corresponding elements is set.
Lastly, this method does not support to set the conditions of layout and string necessary for the extraction of a keyword. It is therefore necessary to manually collect information necessary for the extraction of a keyword from a non-structured document itself or rules or the like defining the format of the non-structured document. This requires a large amount of labor.
JP-A-6-290173 gives the following description. A document structure indicating each element of a labeled document is generated by referring to a "schema" describing restricting information of the document structure, and then a structured document is generated.
In JP-A-6-290173, however, although use of the schema describing restricting information of the document structure is described, how the schema is generated is not described.