The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed inventions.
It is straightforward, if often computationally intensive in aggregate, to determine when two objects are identical, but it is far more useful to determine when two objects are similar. A primary cost is the pairwise comparisons, leading to a quadratic explosion in the work to compare a corpus of documents. For example, approximately one third of all web pages have look-alike pages that are nearly identical to the identified page, albeit with disparate legal boilerplate, header and footer details, such as dates and organization titles and pagination. It becomes useful and economically advantageous to assess similarity between objects, with sampling techniques and comparisons; for example, for deduplication of files and for plagiarism detection. In other applications, such as entity resolution, the goal is to find people with missing or added middle initials or names in otherwise matching data sets.
Computing approximate file similarity in very large files is a common task with many data management and information retrieval applications.