The ability to find material relevant to a small collection of documents from a large collection of documents (i.e., similar document retrieval), is a well known and long studied problem. The current best approach to solving this problem is based on measuring cosine similarity to a keyword index of the document corpus. However these indexes, once built, make a limiting assumption about the granularity of the similarity being searched for. The assumption is that documents in an input set of documents are “overall” similar to documents in the large collection. This means that the overall term frequency of each document must be relatively similar before the match can be discovered. For long documents, this is not a reasonable assumption to make. The search index is basically built at the wrong level of detail to provide information on individual sentences and paragraphs that make up each document.