1. Field of the Invention
The present invention relates to a method that searches for documents using a textual query. In particular, the present invention relates to methods for such searches that are not based on creating an explicit inverted index for search words.
2. Discussion of the Related Art
To allow a corpus of documents to be searched based on given words typically involves creating an inverted index which maps each word in a selected vocabulary to a list of documents containing that word. (An index maps a word to its occurrences in a document). Searching multiple words in the vocabulary then involves creating a union of the corresponding lists from the inverted index, and listing the resulting documents in decreasing order of relevance. Relevance may be determined based on a number of factors, such as the number of the words in the search query that are found in each document. The index itself is typically augmented, for each document, with information about each instance of the word in the document, such as the word's location, type and font used.
Creating an inverted index and performing a relevance computation based on literal occurrences of the words in the text can sometimes be brittle, and can often leads to results that do not reflect what the searcher intends to look for. For example, for the search query “new york auto show,” if the search engine looks for documents that contain the words “new,” “york,” “auto,” and “show,” the search engine may home in on information about a new auto show in York, England, but may miss results that relate to a “car show in New York City.” Such a result occurs because a strict literal interpretation of the textual query may overlook the similar meanings of the words “car” and “auto” in some context. Also, focusing only on occurrences of the words alone would miss the fact that the words “new york” in proximity becomes a term that has a different meaning than when these words appearing individually, being interspersed among other text. Thus, a method for searching for documents that is not based on conventional literal processing of the textual query is desired.