In the related art, in companies or the like, as one of measures to information leakage, posterior investigation for specifying an information leaked original is performed. In posterior investigation for the information leakage, together with a manipulation log at the time of file manipulation such as document browsing or document preserving, features (fingerprints (hereinafter, referred to as feature data)) of the manipulated document are recorded. Therefore, in the investigation of the information leakage, the information leaked original is specified by specifying the manipulation log of a document similar to the leaked information by referring to the recorded feature data.
Patent Document 1: Japanese Laid-open Patent Publication No. 2005-4560
However, in the related art described above, in the investigation of table data describing information of various attributes (columns) for each record (row), feature data become huge in comparison with typical documents such as texts, so that there is a problem in that the investigation time is long. For example, in the case of performing posterior investigation for leakage of table data, feature data of the table data recorded at the time of manipulation become huge, so that a certain time is needed for the investigation.
FIG. 20 is an explanation diagram illustrating investigation of information leakage. As illustrated in FIG. 20, with respect to company confidential information 501, a manipulation log of a document 502 on which a user 511 performs manipulation such as referring or updating or feature data of the document 502 are recorded in an intra-company browsing file log 521. For example, together with a file name such as “kimitsu.doc” of the document 502 on which the user 511 performs manipulation, data representing features (lists of keywords) of “for the middle-term business plane . . . ” in the document 502 are extracted and recorded in the intra-company browsing file log 521.
Herein, printing of the document 502 is assumed to be leakage to the outside of the company as a document 503 after editing and processing. In addition, the document 503 is assumed to have a file name of “maruhi.pdf” or the like, and thus, the file name is changed to a file name different from that of the document 502. In addition, the contents of the document 503 are changed so that a portion of the contents is different from those of the document 502, for example, like “for the middle-term plan of the business department . . . ”. Therefore, in the posterior investigation of the information leakage, the document 502 as the leaked original is specified by searching for features similar to the feature extracted from the document 503 among the features recorded in the intra-company browsing file log 521.
With respect to the feature data representing the features of the document, the feature data is difficult to treat in a keyword state, and thus, data obtained by hashing the keywords are used. For example, the keywords are hashed, and a range-reduced hash value is obtained by performing modulo operation (mod) with a constant n (in addition, the hash value is set as a value obtained by performing mod operation with a constant n, and a hash value before performing the mod operation is set as an intermediate hash value). In this manner, since the keywords included in the document are represented by hash values, feature data where the features of the document are represented by an n×n effective graph are obtained.
In addition, in the case where the value of n is set as about 10000 and, after that, the keywords are hashed, in some cases, the same hash values may be obtained between different keywords. However, in the n×n effective graph, the features are represented by a combination of the keywords. Therefore, between the documents of which contents are different, a possibility that the contents are changed into the same hash value is low, and the effective graph has a property that the contents are difficult to become the same.
In the posterior investigation of the information leakage, a similar document (document 502 as the leaked original) is obtained by comparing the feature data recorded in the intra-company browsing file log 521 with the feature data extracted from the document 503.
FIG. 21 is an explanation diagram illustrating comparison of the feature data. As illustrated in FIG. 21, in the posterior investigation of the information leakage, by obtaining a comparison result 504 by comparing feature data 502a recorded in the intra-company browsing file log 521 with feature data 503a extracted from the document 503, a document similar to the document 503 among the documents of which the features are recorded in the intra-company browsing file log 521 is specified as a leaked original.
More specifically, by representing the feature data 502a and 503a by an n×n effective graph, a combination of features of the document can be represented in an n×n space by setting a flag (being set to 1) of a location where a pair exist. Therefore, in the comparison of the feature data 502a and the feature data 503a, the comparison result 504 is obtained by applying “and(&)” to the n×n space. Next, with respect to the obtained comparison result 504, similarity between the documents corresponding to the feature data 502a and 503a is determined based on the number of 1 (true value) in the “and” of the n×n space.
Herein, in the case of adapting the documents 502 and 503 or the like as the table data, the above-described keyword is replaced with one attribute value included in the table data. However, the table data are obtained by extracting from original data stored in database by using SQL statements or the like. Accordingly, replacing of attributes included in the table data and replacing, adding, removing, or the like of records can be simply changed. Therefore, in the case of extracting the features from the table data, in order to correspond to replacing of the attributes and replacing, adding, removing or the like of records, features as a combination of two features for each attribute are comprehensively produced.
FIG. 22 is an explanation diagram illustrating extraction of the feature data from a table data 505. As illustrated in FIG. 22, in the case of extracting the feature data from the table data 505, combinations of “baseball” of keyword 1, “43294” of keyword 2, and the like are comprehensively produced. Therefore, the data amount (feature amount) of the feature data is combinations of attributes such as “ID”, “name”, “age”, or “hobby”×the number of rows (number of records), which becomes huge. Therefore, in the case of performing the posterior investigation of table data leakage, a time (investigation time) taken for the process according to the extraction of the feature data or the determination of the similarity is increased in comparison with a typical document.