A document search involves determining which documents are relevant to a query. Traditionally, the search process starts with a query and a corpus of documents, and compares the words in the query with the words in a document. A scoring algorithm assigns scores to the documents in the corpus based on which words in the query appear in the various documents, and with what frequency. The documents are then ranked based on their scores, and the results of the search are presented.
However, a simple comparison of words in the query with words in the document often leads to unsatisfactory results. The significance of some words may be ambiguous—e.g., in English, “cold” may refer to an illness or a weather condition. If a query contains this word, then scoring documents based on how frequently the word “cold” appears in each document is likely to identify some documents that relate to winter weather and others that relate to rhinovirus.
Search systems that focus on comparing words in a query with words in a document often fail to identify documents that contain the type of subject matter that a searcher is looking for. There may be cues that would guide the search system to the right document, but these cues are often ignored.