A method for evaluating similarity between different documents is widely used for evaluation of the similarity between scientific papers or detection of the similarity between corporate documents. Patent documents 1 to 3 disclose document similarity determination systems.
In a document similarity determination system disclosed in patent documents 1 to 2, first, an entire document is separated for each page or split at each position at which a particular character string appears (hereinafter, one separated (or split) unit is referred to as “segment”) and a characteristic value is calculated for each segment. The similarity between different documents is determined based on the number of the segments whose characteristic values are equal to each other by comparing the characteristic values of the segments in order from the first to the last segment in the document. As a result of the determination, when the number of the segments whose characteristic values are equal to each other is large, the similarity between the documents is high and conversely, when the number of the segments whose characteristic values are equal to each other is small, the similarity between the documents is low.
In a document similarity determination system disclosed in patent document 3, a figure and an equation that exist in a document are separated from a sentence, a degree of congestion is defined with respect to the layout of the separated figure and equation, and the degree of congestion is used as an index for determining the similarity.
[Patent Document]
[patent document 1] Japanese Patent Application Laid-Open No. 2008-257444
[patent document 2] Japanese Patent Application Laid-Open No. 2010-256951
[patent document 3] International Publication No. WO 2009/048149