Determination of document similarity by computing devices is used to support a variety of functionality. In a recommendation scenario, for instance, location of a document by a computing device describing a product or service that has been purchased by a user may be used by the computing device to locate similar documents describing similar products or services. Similarity of these documents (e.g., product or service descriptions) may then be used as basis to form recommendations by the computing device for the user, such as to recommend one news article based on the content of another news article with which the user has interacted. Similar techniques may be used in marketing scenarios by the computing device to suggest similar advertisements, find related items in a search context, locate similar social network communications, and so forth. Thus, the uses for determinations of document similarity by a computing device may vary as greatly as what is described by the documents.
Conventional techniques used to determine document similarity are computing resource intensive, which limits availability of these techniques. In one conventional technique, a brute force approach is used by computing devices in which each document is compared to each other document to determine similarity. Therefore, even in instances of one thousand documents, time complexity of such an approach is in the order of a million operations by the computing devices. This may be further complicated by the sparsity of data exhibited by the documents since each document typically includes relatively few of the billions of available words in a human language. Therefore, comparison of each of the available words as part of the billions of available words may consume a significant amount of resources of the computing devices. In instances of one hundred thousand documents, for instance, a forty node cluster of computing devices may take approximately six hours to compute similarity of the documents to each other. Since the time complexity is quadratic based on the number of documents, this run time may quickly increase to days for document comparisons of even larger libraries of documents, such as descriptions of games and movies that may exceed millions of documents in typical online scenarios.