Search engines are a powerful tool for sifting through vast amounts of stored information in a structured and discriminating scheme. Popular search engines such as MSN®, Google® and Yahoo!® service tens of millions of queries for information every day. A typical search engine for use in finding documents on the World Wide Web operates by a coordinated set of programs including a spider (also referred to as a “crawler” or “bot”) that gathers information from web pages on the World Wide Web in order to create entries for a search engine index, or log; an indexing program that creates the log from the web pages that have been read; and a search program that receives a search query, compares it to the entries in the log, and returns results appropriate to the search query.
A current area of significant research in the field of search engine technology is how to improve the efficiency and quality of results for a given search query. So called concept-based searching involves using statistical analysis on various search criteria in order to identify and suggest alternative search queries that are highly semantically related to the input search query. Identifying alternative, highly correlated search queries can help focus and improve the search results for a given search. Moreover, companies and advertisers present advertising when particular queries are entered. It would be extremely beneficial to such companies and advertisers to associate their advertising with particular queries as well as other semantically related queries.
In an example of a prior art system employing concept-based searching, queries are correlated together depending on the degree to which results returned in the respective queries are the same. Thus, if first and second queries return nearly identical search results, these two queries would be considered highly correlated with each other. Another popular search technology relates to analyzing and comparing the semantic input queries themselves to the entries in the database log. If two queries are found to be semantically related, then the search results returned by the respective queries should be highly correlated.
In search engines used for web searches and other database searches, long queries are often difficult to handle. Conventional approaches to searching use all query terms as a conjunction. Accordingly, long queries may produce no results. Moreover, processing long queries is computationally difficult. It may be possible to scan all the entries in the log, which may often include millions of entries, and compare each of the entries with the original query. Each of these comparisons in turn is an expensive operation (quadratic in the length of the strings). Therefore, this approach is not feasible for large query logs and long strings.