1. Field
The subject matter disclosed herein relates generally to data processing, and more specifically to methods, apparatuses, and/or systems for use in distributed computing environments.
2. Information
Information reposed in a digital form, such as information in the form of binary digital signals, is continually being generated or otherwise identified, collected, or stored. Due in part to the vast amount of information available, there may occasionally be a desire to identify similarities or dissimilarities among this information. This may be useful, for example, for information mining, information integration, information cleaning, or other applications or purposes. While various ways exist to identify similar or dissimilar information, a common approach employs a technique called “similarity join”. Similarity join is a technique which may compare records to identify similarities or dissimilarities. For example, a record pair may be considered similar if a similarity function used by similarity join returns a value that is greater than a threshold, as just an example.
While some similarity join techniques may provide a reasonable approach to identify similar or dissimilar information, in certain situations such techniques may prove less desirable. For instance, some similarity join techniques may be less feasible or less efficient with respect to determining similarity or dissimilarity on large amounts of information. In addition, one or more optimizations may be performed during some similarity join techniques to prune false positive candidate pairs; some of these optimizations may introduce further challenges. Thus, it may be useful to employ methods and/or systems that identify similar or dissimilar information in a more efficient or effective manner, and/or to prune false positive candidate pairs in a more efficient or effective manner.