The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
It is a common goal for data processors to remove duplicate records from a database of records (e.g., customers' contact information), as duplicate records provide inaccurate information, and can result in wasted mailing costs and customer dissatisfaction.
In the past, duplicate records were uncovered using a “brute force” algorithm, where each record is compared to every other record in a database. For example, a database having ten records would require 45 comparisons. Adding an additional record to the database would require ten additional comparisons, and adding another record would require eleven additional comparisons, and so forth. This can be approximated in big O notation as:
  O  ⁡      (                  n        2            2        )  
That is, for an input of n records, the time required for processing is proportional to n2. Although comparisons can be done very quickly with today's computers, the sheer number of comparisons required even for small databases (e.g., one million records) can easily exceed practical time spans.
Because of the breadth of the data often contained in a record, it is common practice to use only a subset of a database's fields. Common field types used for matching include: first name, last name, street address, phone number, company, and so forth. In addition, to reduce the amount of processing time required, it is known to first create subsets of records that share a certain attribute. For example, a database of records could be divided by the first digit of each record's zip code, creating 10 subsets. Each record in a subset is then compared to every other record in that subset using a “brute force” algorithm. For a database with m evenly-sized clusters, processing speed is reduced:
  O  ⁡      (                  n        2            m        )  
For large values of m, the time savings can be very significant. Although this process reduces processing time, the process is incomplete because records in one subset are not compared with records in other subsets. Thus, if a record in subset A were to match another record in subset B, the match would not be found.
Others have made efforts in the past to create methods of eliminating duplicated items in a database. U.S. Pat. No. 5,303,149 to Janigian, U.S. Pat. No. 5,799,302 to Johnson et al., U.S. Patent Publ. No. 2012/0290597 to Henzinger (publ. November 2012), and U.S. Patent Publ. No. 2013/0144847 to Spurlock (publ. June 2013) all incorporate the use of two different criteria to arrive at a final set of duplicated items. However, in these documents, a first criterion is applied to create a first subset, and then a second criterion is applied to the first subset to further narrow the first subset.
Additionally, U.S. Patent Publ. No. 2012/0296903 to Khan et al. (publ. November 2012) describes brute force comparison of items to check for duplication. Such a process, as described above, requires the checking of each item against every single other item, and is impractical for large numbers of records.
Various other processes of detecting duplicate records are described in the art. See, e.g., U.S. Pat. No. 6,374,241 to Lamburt et al.; U.S. Patent Publ. No. 2005/0273452 to Molloy, et al. (publ. December 2005); U.S. Patent Publ. No. 2012/0059853 (publ. March 2012); WIPO Publ. No. 00/34897 to Bloodhound Software, Inc. (publ. June 2000); and WIPO Publ. No. 2009/132263 to Lexis-Nexis Risk & Information Analytics Group, Inc. (publ. October 2009); U.S. Pat. No. 5,303,149 to Janigian; U.S. Pat. No. 5,799,302 to Johnson et al.; U.S. Pat. No. 8,554,742 Naeymi-Rad et al.; U.S. Patent Publ. No. 2013/0144847 to Spurlock (publ. June 2013); U.S. Patent Publ. No. 2012/0209853 to Desai et al. (publ. August 2012); U.S. Patent Publ. No. 2012/0296903 to Khan et al. (publ. November 2012); U.S. Patent Publ. No. 2012/0290597 to Henzinger (publ. November 2012); U.S. Pat. No. 8,046,372 to Thirumalai et al. However, all the processes known to Applicants are also incomplete and fail to appreciate the creation of intersecting sets to reduce the number of comparisons required to identify duplicate records.
These and all other extrinsic materials discussed herein are incorporated by reference in their entirety. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
Thus, there is still a need for improved systems and methods for clustered matching of records.