Duplicate detection and elimination or flagging are processes that are useful in many contexts. Systems that utilize a data collection as a basis for decision making suffer from bloat and loss of accuracy due to duplicate data samples. For instance, a categorization by example system, which attempts to code data samples in a manner similar to one or more data samples in a training set, can be skewed due to the effective doubling (or worse) of the influence measures when duplicate samples appear in the training set. This results in inaccurate coding. Furthermore, duplicate samples in the training set often represent wasted space in a system's indexing of the data collection set. This may be significant when the collection contains millions of samples. Other systems may track duplicate entries, benefiting from accurate identification of duplicate or near duplicate data samples. For instance, a litigation support system may manage images of documents, optical character recognition processed documents, or volumes of e-mail. Management may include looking for duplicates, to determine which parties received certain information. Thus, duplicate detection can be useful either to eliminate the duplicates or to flag them.