1. Field of the Invention
The present invention relates to logical structure (relationship) analysis among character strings on forms, including paper and electronic.
2. Description of the Related Art
Conventionally, to extract data from paper forms, structured forms, forms having fixed layouts, have been used. In the extraction of data from structured forms, characters or character strings having certain meaning exist in certain areas and field definitions for the layout are generated defining such characters, character strings and positions. Data are extracted by analyzing these characters and/or character strings written in the relevant areas. However, the cost of creating field definitions for the layout becomes huge if created for a form that is processed for only a few copies; hence, data are manually input for such a form, which instead requires a huge data entry cost.
Recently, data positions can be identified even if the order of headings is changed in a table, and a data extracting method is disclosed in Japanese Patent Application Laid-Open Publication No. 2005-275830 for a case in which headings in the same column are reversed when a table is created.
However, conventionally, logical structure analysis for unstructured layouts is either a mode of extracting data when a relationship between headings is hierarchical such as a relationship between a main heading and a subheading or based on equivalent relationships among the subheadings making up the hierarchy, or a logical structure analyzing mode applicable to a form having certain ruled lines and cells defining non-unique orders within heading groups.
Therefore, it is problematic that the conventional process is not applicable to (1) a form from which data is obtained by plural headings, however, it cannot be determined whether the data forms a table, (2) a form with the same heading in plural, each of which corresponds to a piece of data respectively, or (3) a form having a structure in which headings and corresponding data are not adjacent, such as (heading 1)—(heading 2)—(data corresponding to heading 1)—(data corresponding to heading 2).
For example, problems (1) and (3) above are not addressed by the technique disclosed in Japanese Patent Application Laid-Open Publication No. 2005-275830, since cell relationship is used on the premise that a table is used. Further, if problem (2) is not addressed, relationships remain ambiguous when plural data items correspond to the same heading character string, resulting in decreased accuracy in logical structure analysis.