Computing similarity between sets is critical for managing and sorting through massive amounts of data. This data can come from multiple sources, some of which overlap. The current methods and devices for sorting rely on the Jaccard index as a base for determining similarity, but computing the Jaccard index requires knowledge of the intersections of two sets, a quantity not automatically known. There exists a need to be able to measure similarity between sets without knowing the intersection of data sets. The present invention does just that.
U.S. Pat. No. 6,240,409, entitled “METHOD AND APPARATUS FOR DETECTING AND SUMMARIZING DOCUMENT SIMILARITY WITHIN LARGE DOCUMENT SETS,” discloses a method for comparing an input file to a set of files. The comparison is achieved by splitting up the document into substrings and compares it to substrings from the set. U.S. Pat. No. 6,240,409 is hereby incorporated by reference into the present specification.
U.S. Pat. No. 5,953,006, entitled “METHODS AND APPARATUS FOR DETECTING AND DISPLAYING SIMILARITIES IN LARGE DATA SETS,” discloses a method for determining similarities between sets using dotplots. These dotplots graphically display how similar the different items in the sets are. U.S. Pat. No. 5,953,006 is hereby incorporated by reference into the present specification.
U.S. Pat. No. 7,260,773, entitled “DEVICE SYSTEM AND METHOD FOR DETERMINING DOCUMENT SIMILARITIES AND DIFFERENCES,” discloses a method to determine the similarity between sets of documents by dividing each document into subsections. The subsections are then compared to determine similarity. U.S. Pat. No. 7,260,773 is hereby incorporated by reference into the present specification.