For some databases, raw input is collected from a variety of heterogeneous data sources, such that a single real-world entity (such as a person or a product) may be represented by multiple input records. In such scenarios, the detection and elimination of redundant information may be required for various applications. The same information can legitimately be represented in several different ways: for example, one record referring to a given individual may use a shortened version of a name (“Dan” or “Danny”), while another uses the full version (“Daniel”); addresses may be represented differently (e.g., “South First Street” versus “S. 1st St.”) in the two records, and so on. Even with today's fast computing cores and large memories, comparing all possible pairs of records in a large data set to identify duplicates may be intractable. Identifying sub-groups or blocks of similar records of large data sets on which similarity-based redundancy elimination can be performed in reasonable timeframes remains a non-trivial technical challenge.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.