An attempt to spread the use of open data which makes it possible for a third party to reuse the information held by the country or companies has become a trend. It is expected that, by combining a variety of information such as linked data, it becomes possible to perform sophisticated search and analysis which have been impossible to be performed by the existing technique. The disclosed data sometimes has a format which is easily subjected to machine processing, such as resource description framework (RDF), and sometimes has a format which is not easily subjected to machine processing (the format whose design of a correspondence between a numerical value and an attribute is not strict enough), such as Excel data or comma-separated values (CSV). What will become important in the future is how to convert efficiently such numerical tabular data into a format which is easily processed.
FIG. 1 is a diagram depicting an example of the numerical tabular data and includes numerical portion data in which numerical values are set and attribute portion data in which character strings (text) are set. In this example, in a left part (a left-hand direction) and an upper part (an upper direction) of the numerical portion data, the attribute portion data is provided. Depending on the numerical tabular data, the attribute portion data is sometimes present in only one of the left part and the upper part of the numerical portion data.
FIG. 2 indicates that, in the numerical tabular data depicted in FIG. 1, a numerical value “37,825,636” surrounded with a thick frame is related to an attribute “cash benefits” in the left part and an attribute “FY 2005 (Heisei 17)” in the upper part.
FIG. 3 indicates that, in the numerical tabular data depicted in FIG. 1, attributes surrounded with a thick frame include an unexplicit hierarchical structure. That is, although it is implicitly indicated that attributes such as “retirement pensions” hold subordinate positions to “cash benefits” in the thick frame because the heads of the characters of these attributes are shifted, it is not impossible to determine a hierarchical structure clearly. If a person who understands the meaning of the attributes such as “cash benefits” and “retirement pensions” views this numerical tabular data, the person is able to understand a hierarchical structure; however, it is difficult to perform accurate processing if processing is performed mechanically.
FIG. 4 indicates that there is a sum input-output relation among numerical values corresponding to the attributes in the thick frame described in FIG. 3. That is, a numerical value “37,188,028” corresponding to “cash benefits” is the sum of numerical values “36,724,189” to “61,174” corresponding to “retirement pensions” to “other cash benefits”. Incidentally, depending on the numerical tabular data, there is sometimes a product input-output relation instead of a sum input-output relation.
FIG. 5 is an example of attribute labeling performed by appropriately recognizing a hierarchical structure of the attributes “I Elderly people” to “other cash benefits” surrounded with a thick frame in the numerical tabular data depicted in FIG. 1, and it is desired that such attribute labeling is performed automatically.
Hereinafter, a method of existing automatic attribute labeling will be described. Incidentally, the following description deals with a case in which input regions (cells) of numerical values spread in a horizontal direction and attributes are disposed in an upper part, but the same applies to a case in which the input regions of numerical values spread in a vertical direction and the attributes are disposed in a left part. Moreover, the following description deals with a case in which there is a sum input-output relation among numerical values in the input regions, but the same applies to a case in which there is a product input-output relation among numerical values in the input regions.
FIG. 6 is a diagram depicting an example of the existing attribute labeling and depicts an example in which, as numerical tabular data which is input, the “total number” and the numbers of “deaths” and “injuries” of each of “traffic accidents” and “water accidents” are indicated.
In the past, a person who performs processing has set an attribute labeling pattern such as “A cell located immediately above a certain cell is treated as a master label. If the cell located immediately above the certain cell is blank, a non-blank cell which is located on the left-hand side of the cell located immediately above the certain cell and is closest thereto is treated as a master label. If there are a plurality of stages, processing is performed recursively from a lower stage for each row of an upper stage.” for such numerical tabular data in an information processing device and made the information processing device perform labeling automatically. For example, as for a label “total number” on the left end of the numerical tabular data, a label “traffic accidents” located immediately above the label “total number” is treated as a master label, and the label “total number” is regarded as a label having a hierarchical structure “traffic accidents-total number”. As for a label “deaths” next to the label “total number”, since a cell located immediately above the label “deaths” is blank, the label “traffic accidents” which is located on the left-hand side of the cell located immediately above the label “deaths” and is closest thereto is treated as a master label, and the label “deaths” is regarded as a label having a hierarchical structure “traffic accidents-deaths”. The same goes for the other labels. In this example, labeling accurately reflecting a hierarchical structure is performed.
FIG. 7 is a diagram depicting another example of the existing attribute labeling, the example in which the positions of “traffic accidents” and “water accidents” of the numerical tabular data which is input are shifted to the right by one cell as compared to the positions in FIG. 6. As a material which humans view, this style is not an unnatural one; in this style, “traffic accidents” and “water accidents” are displayed in the middle of a group of “total number”, “deaths”, and “injuries”.
In this case, if the same attribute labeling pattern as the attribute labeling pattern described above is applied, since another label is not present in a cell located immediately above a label “total number” located on the left end of the numerical tabular data, a cell located on a left-hand side of the cell located immediately above the label “total number”, and a cell located above the cell located immediately above the label “total number”, the label “total number”, which is supposed to be labeled as “traffic accidents-total number”, is incorrectly labeled simply as “total number”. Moreover, as for “total number” belonging to “water accidents”, “traffic accidents” which is located on the left-hand side of the blank cell located immediately above this “total number” and is closest thereto is treated as a master label, and this “total number”, which is supposed to be labeled as “water accidents-total number”, is incorrectly labeled as “traffic accidents-total number”.
FIG. 8 is a diagram depicting another example of the existing attribute labeling, the example in which, as numerical tabular data which is input, on the left-hand side of the data depicted in FIG. 6, “total number”, “deaths”, and “injuries” related to “earthquakes” and “tsunamis” belonging to “disasters” and “total number” related to “disasters” are added. This example is a case in which there are a plurality of structural relations having different depths.
In this case, if the same attribute labeling pattern as the attribute labeling pattern described above is applied, for “traffic accidents” and “water accidents”, “disasters” in the row located above “traffic accidents” and “water accidents” is treated as a master label, and a large number of incorrect labels with “disasters” attached thereto as a master label are undesirably generated.
On the other hand, a method of determining a hierarchical structure based on information defining the hierarchical structure of attribute values of tabular data and a method of determining a hierarchical structure based on the format or meaning of character strings in cells are disclosed (for example, see Japanese Laid-open Patent Publication No. 2013-257852, Japanese Examined Patent Application Publication No. 7-43707, and so forth).
Moreover, a method of judging whether or not cells have a master-slave relation by using indents or fonts as the amount of characteristics and extracting a combination having a tree structure is disclosed (for example, see Zen Chen and Michael Cafarella, “Automatic Web Spreadsheet Data Extraction”, VLDB 2013 and so forth).