Joining massive datasets based on similarity rests at the core of many important problems. For example, one important problem within the field of information retrieval is data mining, which seeks to identify patterns between collection of items, such as, documents, images, or other unstructured content. Generally there is some criterion to measure similarity between data members, which can be expressed as a mathematical formula. In general, we have two massive datasets, and we want to “join” the datasets to identify pairs or clusters where there is at least one member from each dataset that is similar to another member from the other dataset. An important special case is the “self-join” where duplicate, near-duplicate, or very similar items within a single dataset are identified. An important application is the emerging areas of content-addressable storage and intelligent file storage, where a target dataset is joined, either against a reference collection, or against itself to identify duplicates and near-duplicates. Although computers become faster, storage more expansive, and content more varied, our ability to make effective sense of massive datasets has not kept pace. This presents a problem.