The World Wide Web is a distributed database having billions of data records accessible through the Internet. Search engines are commonly used to search the information available on computer networks, such as the World Wide Web, to enable users to locate data records of interest. Web pages, hypertext documents, and other data records from various sources, accessible via the Internet or other networks, are typically collected by a crawler. Crawlers may collect data records from the sources using various methods and algorithms. For example, a crawler may follow hyperlinks in a collected hypertext document to collect other data records. The data records retrieved by the crawlers are stored in a database or a plurality of databases.
The data records are typically indexed by an indexer, which builds a searchable index of the documents in the database. Known methods for indexing the database may include inverted files, vector spaces, suffix structures, and hybrids thereof. For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations. A primary index of the entire database may be broken down into a plurality of sub-indices, and each sub-index is sent to a search node.
To use the search engine, a user typically enters one or more search terms or keywords, which are sent to a dispatcher. The dispatcher compiles a list of search nodes in a cluster to execute the query, and forwards the query to those selected search nodes. The search nodes search respective parts of the primary index and return sorted search results along with a document identifier. The dispatcher merges the received results to produce a final result set displayed to the user, which is usually sorted by relevance scores.
The relevance score is a function of the query itself and the type of document produced. Factors that affect the relevance score may include: a) a static relevance score for the document, such as link cardinality and page quality; b) placement of the search terms in the document, such as titles, metadata, and document web address; c) document rank, such as a number of external data records referring to the document and the “level” of the data records; and d) document statistics, such as query term frequency in the document, global term frequency, and term distances within the document. For example, “term frequency inverse document frequency” (TFIDF) is a statistical technique that is suitable for evaluating how important a word is to a document. The importance increases proportionally to the number of times a word appears in the retrieved documents, but is offset by how common the word is in all of the documents in the collection of documents, referred to as the “corpus.”
Some known searching processes expand or rewrite the query to include other terms. However, known expansion processes may include erroneous expanded terms if the original query contains spelling errors or if there is vocabulary mismatch between the query and the document collection, which results in the retrieval of non-relevant documents. Other processes return erroneous expansion results if the initial returned documents are not the most relevant.