There have been proposed numerous techniques for deciding identicalness or similarity of information such as documents or images, among which a technique of same document matching for deciding identicalness of documents is renown. Same document matching refers to a technique of grouping documents that are substantially the same. The term “substantially the same” refers to a condition in which two notationally different materials are decided to be identical by human vision.
Same document matching is required in the following situations, for example:
(1) Redundant Record Matching for a Database (sometimes abbreviated as DB Hereinbelow)
Redundant record matching for a DB refers to grouping of substantially the same records in a DB, and is required in, for example, data cleaning in combining customer DB's that are managed by different people, in different places or according to different methods and that incorporate therein notational variations, or redundancy deletion of inquiry cases in a contact center. When one document is regarded as one record, this can be considered as a problem of same document matching.
(2) Topic Analysis
Topic analysis refers to grouping of posted data such as those in blogs, and is required in knowing when and where the same subject becomes a topic in a blog.
A same document matching system is input with a set of documents of interest and a similarity threshold serving as a condition that documents are regarded as substantially the same, and outputs same document groups. For example, a case as shown in FIG. 1(a) will be described, where five documents and a similarity threshold of 90% are input. In this case, every document is composed of ten alphabetical characters, and a similarity of 90% between document x and document y means that nine characters out of 10 characters in x and those in y are present in common. The system then outputs pairs of two different documents having a similarity equal to or greater than 90% as same document groups as shown in FIG. 1(b). Moreover, the document pairs containing common documents may be combined to form a same document group as shown in FIG. 1(c).
One conventional technique for implementing a same document matching system employs hierarchical clustering (see Paragraph 2.5.7 in Non-patent Document 1). The method calculates a similarity for all document pairs beforehand (Step 1). Next, the document pairs are sequentially combined starting from a pair having the highest similarity to thereby hierarchically group the documents (Step 2). The same document matching system can provide same document groups by calculating similarities of all pairs of two different documents as in Step 1, and thereafter, leaving only document pairs having a similarity equal to or greater than a similarity threshold.
In the example of FIG. 1, the number of all pairs of two different documents is 5*(5−1)/2=10, and hence, the similarity is calculated ten times to output the result shown in FIG. 1(b) or (c).
Another conventional technique for implementing a same document matching system employs redundant record matching for DB's (see Non-patent Document 2). The method involves first sorting records in DB's, next performing similarity calculation on a record pair of each sorted record and ‘n’ preceding records, and defining a record pair having a similarity equal to or greater than a threshold as redundant record.
A similar technique thereto can be applied to the same document matching system by substituting records with documents. For example, sorting of the documents in FIG. 1(a) by character string results in FIG. 2(a). Next, similarity calculation is applied to a document pair of each document and one preceding document, and then, the similarity is calculated four times and a result shown in FIG. 2(b) or (c) is output.
Moreover, still another conventional technique for implementing a same document matching system employs K-means (see Paragraph 5.2 in Non-patent Document 3). The method involves a premise that a set of documents should be divided into K groups, and based on that premise, K randomly selected documents are assumed to serve as centers of groups, respectively, and the rest of the documents are classified so that each document having a highest similarity with one of the center documents of the K groups is classified into that group.
A similar technique thereto can be applied to the same document matching system by posing a restriction of a similarity threshold on the K-means. Specifically, assuming that K randomly selected documents serve as centers of groups, respectively, the rest of the documents may be classified so that each document having a highest similarity with one of the center documents of the K groups and having a similarity equal to or greater than a threshold is classified into that group.    Non-patent Document 1: Takenobu TOKUNAGA, “Languages and Computations, Vol. 5, Information Retrieval and Text Processing,” University of Tokyo Press.    Non-patent Document 2: Mauricio A. Hernandez and Salvatore J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 127-138, 1995.    Non-patent Document 3: Jain, A. K., Murty M. N., and Flynn P. J., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31, No. 3, pp. 264-323, 1999.