Embodiments of a system and method for using an exemplar document or search query to retrieve relevant documents from an inverted index of a large corpus of documents are presented herein.
With the ever-increasing amount of data stored in electronic form, it is becoming increasingly important to effectively search through a large amount of data to find relevant data. With respect to text data, it is important to quickly and accurately search a large number of text documents (a “corpus”) to find documents of relevance to the searcher.
Various methods are known that allow a searcher to enter search terms or utilize an exemplar document to search for other documents related to the given search terms or exemplar document. One such method in the prior art involves the ranking of individual search terms by a TF-IDF (Term Frequency-Inverse Document Frequency) score. In such a method, each search term in a query or exemplar document is assigned a TF-IDF score based on 1) the frequency of the term in the query (the “TF” component) and 2) the inverse frequency (rarity) of the term in the documents of the corpus (the “IDF” component).
In general, the TF component will be higher for a given term if the term appears a relatively large number of times in the search query. The reason for this is that a frequently occurring term in the search query is usually a term of high importance to the searcher. Conversely, the IDF component for a given term will be lower if the term appears in a relatively large number of documents in the corpus. The reason for this is that a term that is ubiquitous across documents is often a very common word that is of little value to the searcher. For instance, common English words such as the articles “the” or “an” will occur in nearly every document of English prose and thus will have a low IDF component. Since searchers are usually not concerned with the presence of such ubiquitous words, the low IDF component will minimize the effect of these words on the overall TF-IDF score for these common words.
After calculating an individual TF-IDF score for individual search terms in the query, the individual documents of the corpus are ranked to determine their likely relevance to the searcher. This is often performed using a vector space model along with cosine similarity to determine the similarity between the documents of the corpus and the search query/exemplar document.