A typical business requirement for a data management system is to identify potential duplicate entries in a database and, for each pair of potential duplicates, provide a score indicating the likelihood that the records do in fact represent the same entity. Known algorithms for identifying potential duplicate database entries include character-based difference-scoring algorithms (e.g., Jaro-Winkler, Levenshtein, etc.) and knowledge-based semantic comparisons. Regardless of which algorithm is selected, however, it is typically not feasible to directly compare every record in a database to every other record in the same database due to performance concerns. To compare every record in a database with every other record in the same database, a system would need to perform n2 non-trivial operations, where n is the number of records in the database. For example, for a database of 100 million records, a system would need to perform 10 quadrillion operations in order to compare every record to every other record. A typical database management system cannot complete this number of operations in a convenient amount of time, such as during an overnight run. Many systems therefore utilize some variety of optimized method to identify small clusters of potential duplicate records, and then compare records only within those clusters. Known clustering methods include alphabetic grouping, hashing and matchcode clustering.