Hereinafter, in a method and apparatus that classify a plurality of data according to some criterion, a process of grouping substantially the same data is referred to as “duplicate record verification”. The “substantially the same” indicates, e.g., a relationship between “NEC” and “enu-i-shi” or “1-chome 2-banchi” and “1-2”, that is, a relationship between two data which do not completely coincide with each other in terms of digital data due to Japanese orthographic variation but can be determined to be the same by a human. In the case where duplication of records in a single database or between a plurality of databases is checked, if all combinations are checked, the number of combinations is explosively increased as the number of records is increased, with the result that enormous processing time is required in the duplication check processing.
In a conventional approach, the duplicate record verification is performed by combining a method (rough narrowing-down) with a lower accuracy but with a lighter processing load and a method (detailed narrowing-down) with a heavy processing load but with a high accuracy.
For example, in the case where duplication check is carried out for 100,000 records, the rough narrowing-down method is used first to combine the 100,000 records into a large number of blocks each including several thousand to several ten thousand records and then the detailed narrowing-down method is applied to each block.
As a technique for the rough narrowing-down method, there is known a Sorted Neighborhood Method (disclosed in Non-Patent Document 1) and the like.
As a technique for the detailed narrowing-down method, there is known a method using an edit distance (disclosed in Non-Patent Document 2) and the like. The edit distance is used for measuring similarity between character strings. As a similar technique, there are known a phonetic distance, input (typewriter) distance, and the like.
In the detailed narrowing-down method, similarity or distance between records is calculated followed by setting of a supplied threshold value, whereby the records are combined into groups. The grouped records may be determined as duplication record candidates by a human, or may directly be determined to be duplicated records. Thus, by combining the rough narrowing-down method and detailed narrowing-down method, efficiency of duplicate record verification processing has been improved.
Non-Patent Document 1: M. A. Hernandez and other one, Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem, Journal of Data Mining and Knowledge Discovery, Vol. 1, 1998
Non-Patent Document 2: M. Hernandez and other one, The Merge/Purge Problem for Large Databases, Proceedings of the 1995 ACM SIGMOD International Conference of DATA (SIGMOD 1995), 1995, pages 127 to 138