It sometimes is the case that one wishes to query a very large database with respect to an input data object, i.e., in order to identify other data objects within the database that are identical or very similar to the input data object. Identifying data objects that are identical to the input data object generally can be performed very quickly using straightforward techniques. Unfortunately, merely identifying data objects that are exactly identical often still not be very useful, because a data object with even a slight difference would not be detected.
As a result, certain techniques have been developed that are capable of identifying both exact and close matches fairly quickly. One such technique uses a locality sensitive hashing scheme called Min-Hash/Max-Hash, with the probability of a match being defined by the Jaccard similarity measure. Generally speaking, such an approach identifies matches with a probability that is equal to the degree of overlap between an input set and another set within the database, so that the more information that the two sets have in common, the more likely that the other set will be identified as a match.
While the foregoing approach works well in certain situations, the present inventor has discovered that it is not even applicable in a number of other situations.