Conventionally, to prevent leakage of confidential information or private information caused by the transmission of an electronic mail or the like, it has become important that a mail server in a company, for example, determines a similarity between a transmission document and an archived confidential document before transmitting the transmission document outside the company and does not transmit the transmission document outside the company if a similarity is between the documents.
As a similarity determination method between documents, there is disclosed a technology for computing hash values for sentences that constitute each document of a transmission document and a confidential document and determining whether a sentence having a hash value identical to the hash value of a sentence constituting the confidential document is or not in the transmission document to determine similarity.
Moreover, as a similarity determination method between documents, there is disclosed a technology for determining whether a keyword included in a transmission document is included in a confidential document to determine similarity. For example, when concatenated keywords such as “alkaline battery”, “potassium battery”, or “fuel battery” are retrieved for a keyword “battery”, whether a concatenated keyword included in a transmission document is in a confidential document can be determined at high speed by using the hash value of the keyword “battery” and the character (“e”, “m”, or “1”) before or behind the keyword as an index.
Furthermore, there is a signature that indicates the feature element of a document as an item by which whether or not a keyword is in the document can be determined. A signature is, for example, data that is obtained by performing a logical sum on bit streams obtained from a plurality of keywords in a confidential document. When a result obtained by performing a logical product on the signature and the bit stream of a keyword included in the transmission document is not a zero vector, it is determined that the keyword can be included in the confidential document. In addition, the signature can be computed from a character string that is obtained by adding a peripheral character string of a keyword to the keyword.
Such conventional similarity determination methods have been known as disclosed in, for example, Japanese Laid-open Patent Publication No. 2006-065837, Japanese Laid-open Patent Publication No. 2005-234930, and U.S. Patent Application Publication 2006/0253438.
When a transmission document is a document that is obtained by revising the structure of sentences inside a confidential document or by modifying particles inside the confidential document, there is a problem in that a similarity between the transmission document and the confidential document cannot be determined with high accuracy.
For example, when a similarity between a transmission document and a confidential document is determined for each sentence, a similarity between the transmission document and the confidential document cannot be determined because the hash values of sentences are changed when the transmission document is a document obtained by dividing the sentences of the confidential document, a document obtained by merging the plurality of sentences of the confidential document, or a document obtained by modifying the particles inside the confidential document.
Moreover, when a similarity between a transmission document and a confidential document is determined by retrieving a keyword, whether a keyword included in the transmission document is perfectly identical to a keyword inside the confidential document can be determined. However, a similarity between documents cannot be determined with high accuracy.
Furthermore, when a similarity between a transmission document and a confidential document is determined by using a signature, whether keywords inside the transmission document can be in the confidential document can be determined. However, because the determination is performed by the presence or absence of any bit constituting the keywords, it can be determined by mistake that two documents have similarity even if the keywords included in the transmission document are discretely included in the confidential document in a completely different context.