At present, enterprises investigate causes of information leakage by collecting various logs for information leakage countermeasures. For example, one approach is to select a file similar to information that has been leaked and to investigate a cause of the information leakage. In order to perform this investigation, with respect to a log obtained upon file operation, such as document browsing or storage, rather than the original text of the investigated file, features of the file are obtained as a fingerprint representing features of the original test. Hereinafter, a finger print or fingerprinting will be denoted as “FP”.
For example, if a file including confidential information of a company secret is found, by comparing the FP of that file with FPs registered in a browsing log file in the company, a file similar to the leaked file is able to be retrieved from the log. Further, by following the operation history of the file in the log similar to the leaked information, the cause of the information leakage is able to be identified, too.
FP will be described specifically. FP is a technique for extracting features of a file. FIG. 27 is a diagram illustrating FP. For example, keywords and their arrangements are extracted from a text in a file, and arrangements with directions of the keywords in a specific range are obtained as characters. For example, if there is a first text, “Keyword 1 is a keyword 2, a keyword 3, and a keyword 4.”, features of that first text will be six pairs of keywords, as illustrated by features 10a in FIG. 27.
In FP, a similarity between texts is determined based on the number of matches between their features. For example, it is assumed that features of a second text are features 10b in FIG. 27. When the features 10a of the first text are compared with the features 10b of the second text, of the five pairs of keywords included in the features 10b, four pairs of keywords match the pairs of keywords of the features 10a. Specifically, “keyword 1→keyword 2”, “keyword 1→keyword 3”, “keyword 1→keyword 4”, and “keyword 3→keyword 4” match. It can be said that the greater the number of these matches is, the more similar the texts are to each other.
When the features are treated as data, the keywords are difficult to be treated as they are. Therefore, by making the keywords into hashes and executing remainder operation (mod) with a constant n to obtain hash values with a narrowed range, the features of the text are represented by a validity graph of n×n. Hereinafter, a hash value will be defined as a value that has been subjected to mod with the constant n. The hash value before being subjected to mod will be defined as an intermediate hash value.
For example, if keywords are made into hashes with the value of n being set to about 10000, the same hash values may be obtained for different keywords and the accuracy may be reduced. However, since the features are in pairs of keywords, even if the same hash values are obtained for different keywords to some extent, the probability that both values of the pairs of keywords included in the features of different texts will be converted to the same hash values is low.
FIG. 28 is a diagram illustrating an example of a process of determining a similarity with validity graphs of n×n. An FP 11a in FIG. 28 represents an FP of a text A in an n×n validity graph. An FP 11b in FIG. 28 represents an FP of a text B in an n×n validity graph. For example, it is assumed that the text A includes a pair of keywords, “keyword 1→keyword 2”, a hash value of the keyword 1 is “0”, and a hash value of the keyword 2 is “2”. In this case, for the FP 11a, a value of a portion at which the row of “0” and the column of “2” intersect each other is set to “1”.
By taking an AND between the FP 11a and the FP 11b, a comparison result 11c is obtained. The number of “1” s included in the comparison result 11c will be a value indicating a similarity between the text A and the text B. In the example illustrated in FIG. 28, the similarity between the text A and the text B is “4”. These related-art examples are described, for example, in Japanese Laid-open Patent Publication No. 2010-231766, Japanese Laid-open Patent Publication No. 2014-115719 and International Publication Pamphlet No. WO 2006/048998
By the above described conventional technique, for one-to-one comparison between texts, for example, as described with respect to FIG. 28, by taking an AND between their FPs, a similarity therebetween is able to be determined. On the contrary, if a text similar to leaked information is retrieved from plural files in a log, one-to-many comparison among texts will be performed. In this case, instead of repeating one-to-one comparison, in general, comparison of the respective texts is performed by use of a transposition index.
FIG. 29 is a diagram illustrating comparison by use of a transposition index. In FIG. 29, an FP 12 represents an FP of a retrieval text. Each feature included in the FP 12 is a hash value calculated from a pair of keywords included in the retrieval text. A transposition index 13 is a transposition index of plural texts included in a log, and associates their features with document identifiers. The features of the transposition index 13 are the hash values calculated from the pairs of keywords included in the texts. The document identifiers are information uniquely identifying the texts. For example, the first line of the transposition index 13 indicates that each of files identified by the document identifiers, “001, 003, 007, . . . ”, has the feature, “484893”.
When the FP 12 and the transposition index 13 are compared with each other, a comparison result 14 is obtained. For example, the comparison result 14 associates the document identifiers with amounts of the features. Of these, an amount of feature represents the number of features in the features included in the corresponding text, the number of features matching the retrieval text FP 12, and the greater the amount of features is, the higher the similarity is.
If the amount of data handled with the transposition index exceeds the amount of data of the main storage, in accordance with the increase in the amount of data, the retrieval cost is increased. If data in the transposition index are simply deleted, the feature portion of the texts may be lost, reducing the retrieval accuracy. Therefore, there is a demand for the reduction in the amount of data without the reduction in the determination accuracy.