1. Field of the Invention
The present invention relates to a technology for processing form documents in an electronic non-structured document format.
2. Description of the Related Art
Conventionally, automatic input of data included in a form of a paper document has been developed. For fixed-form documents, a layout definition format is prepared in advance, and the character recognition is performed at reading positions on the paper document to input data.
If a layout of a form to be processed is unknown, however, due to tremendous cost required to prepare layout definition formats, generally, manual search and input for data corresponding to a heading is performed. Thus, high human cost is required. Especially, form documents sent from outside companies, such as statements of delivery and quotations, it is difficult to specify the layout, thereby raising cost in computerization.
Same problem occurs in form documents of non-structured document created with Microsoft Word or Microsoft Excel. The process of copying and pasting must be done by human to input data.
As described above, it is difficult to recognize and extract desired data from a paper form document in which no layout information is provided, or a form document in an electronic non-structured document format. Therefore, automatic recognition and extraction for such documents has been demanded (for example, IEEE Trans. on Pattern analysis and Machine Intelligence, vol. 17, no. 5, pp. 432-445, 1995, titled “Layout Recognition of Multi-Kinds of Table Form Documents” by T. Watanabe et al.). There is a technique of logical structure recognition for non-fixed form documents. For example, in a technique disclosed in Japanese Patent Application Laid-Open Publication No. 2005-275830, data corresponding to a heading is searched based on cell information of a table, without using a headings dictionary.
However, cell configuration is diverse, and when a heading and data are present within a cell, the above technique cannot be applied. Furthermore, when the cell information is erroneous, a wrong relationship between the heading and the data is formed according to such error.
In view of these problems, a method of extracting data corresponding to the headings that have been given beforehand is widely used in recent years. In this method, a character string corresponding to the headings dictionary is first extracted, and then, data corresponding to the character string is extracted. In the method disclosed in Japanese Patent Application Laid-Open Publication No. 2005-275830, even when subheadings are present under headings and data is present also under such subheadings, recognition of data is possible regardless of an order of the subheadings. In the method disclosed in Proc. ICDAR, pp. 458-462, 2005, titled “Universal Data Capture Technology from Semi-structured Forms”, by Diar Tuganbaev et al., a heading is extracted and data corresponding to the heading is extracted.
However, in the method disclosed in Japanese Patent Application Laid-Open Publication No. 2005-275830, while the flexibility for layouts becomes very high, the system can not be applied when subheadings are omitted and can only be applied to the form documents with headings. As a result, while the flexibility for layouts is high, restrictions against character strings are large. Therefore, applicable form documents are limited, thereby having low versatility.
Moreover, in the method disclosed in Proc. ICDAR, pp. 458-462, 2005, sub headings are extracted from headings, and finally corresponding data is recognized. However, a number of similar headings are present within a form document and once a subheading is erroneously recognized, all recognition performed after the erroneous recognition of the subheading results in error.
As described above, the conventional systems of recognizing logical structure from non-fixed form documents have low convenience since, faint line information or cell information is not used in the processing of information within a table, or the right justification in a cell can not be handled. For this reason, these systems are not appropriate as a method of searching for data corresponding to headings or for subheadings corresponding to headings. Furthermore, if cell information is used, processing in response to a variety of combinations of cells is required, and combinations of cells are limited to positional arrangement of the headings.
Moreover, recognition is made on character string information on the assumption that all hierarchies exist.
Since recognition is made from a higher level hierarchy corresponding to a heading, in form documents having a high level hierarchy, the accuracy is degraded as the processing reaches a low level hierarchy such as subheadings and data. Once erroneous recognition is made for a heading of low accuracy, all recognition processing performed thereafter results in error following the erroneous recognition.
Due to dependence on data, these systems can not respond to the case of different element of logical structure even though the headings are the same. As a problem that appears when considering many logical elements and that is a subject of discussion, especially when headings to be recognized increase in number, or when hierarchies of the headings increase, the same character strings increase in character strings of the headings. Accordingly, it becomes important to distinguish a character string corresponding to a desired heading from more than one heading of the same character string, and to perform the consistency processing on recognized results.
In the above conventional technologies, the overall consistency processing of form documents is insufficient.