In most of existing organizations represented by general companies, there are a great volume of documents describing personnel information, customer information, financial information, facility information, and the like individually or in combination. A recent trend of enforcing a personal information protection law and enacting Japanese Sarbanes-Oxley (SOX) act has increased a need to classify, manage and protect many of such in-organization documents by sorting them out more strictly than before. Compared with a document describing customer information for only one person or financial information for only one division, a document describing a plurality of pieces of customer information or financial information together generally causes a greater damage when leaked or lost, and thus is considered to be a more important document in most cases. In the case of describing many pieces of information of specific types such as customer information or financial information, individual pieces of information are normally listed in a table format. Thus, a capability of correctly detecting customer information or financial information from the document using the table format is important for information management.
However, a description method for table data constituting the document using the table format greatly varies depending on how a document file is formatted or how a table is configured. For example, for a certain document, by using software Excel by Microsoft, table data is described in a dedicated table format called Excel book format. For another document, since a table format called a hyper text markup language (HTML) format is employed for description to allow reading by a web browser, table data is described by using HTML-specific tags. Thus, the table data in the documents is described by using structure information dedicated to the respective file formats, and an element configuration varies from one piece of table data to another.
Thus, a conventional method of detecting table data or records described in various formats from documents has been disclosed in, for example, Patent Document 1 (Japanese Patent Application Laid-open No. 2003-150624). In Patent Document 1, there is disclosed a method of analyzing structure of table data based on a TABLE tag, a TR tag, or the like when an HTML document is a target, and extracting the table data by using a structure analysis method dedicated to software such as Excel similarly when the Excel document is a target. Also available is a method of describing table data having no clear dividing lines as a table and listing elements by a text editor. This method is disclosed in, for example, Patent Document 2 (Japanese Patent Application Laid-open No. Hei 9-282208). In Patent Document 2, there is disclosed a method of identifying individual records to extract elements of table data by predefining patterns of text data for identifying heads and tails of the records constituting the table data.
However, the conventional methods described above have the following problems.
A first problem is that preparation of individual table structure analysis methods corresponding to various file formats is not generally easy because detailed specifications of the file formats may not be available.
A second problem is that, when software for creating documents or file formats themselves are different in version while file extensions are similar, a structure describing method for table data may vary, and each new future file format will have to be dealt with.
A third problem is that the conventional method of detecting the record by using not the file format but the text data description pattern necessitates, though not depending on the file format, a user to know all the record description patterns of the individual table data beforehand, and thus it is difficult to apply this conventional method to documents containing various types of table data described by many people or systems.
An exemplary object of this invention is to provide an information classification device, an information classification method, and an information classification program each for accurately estimating individual records constituting table data even when there is no prior knowledge of file formats of the data or identification patterns of the records constituting the table data.