1. Field of the Invention
The present invention is directed to methods for finding semantically related search engine queries.
2. Description of the Related Art
Online search engines provide an enormously powerful tool for accessing the vast amount of information available on the Internet in a structured and discriminating scheme. Popular search engines such as MSN®, Google® and Yahoo!® service tens of millions of queries for information every day. A typical search engine operates by a coordinated set of programs including a spider (also referred to as a “crawler” or “bot”) that gathers information from web pages on the World Wide Web in order to create entries for a search engine index; an indexing program that creates the index from the web pages that have been read; and a search program that receives a search query, compares it to the entries in the index, and returns results appropriate to the search query.
A current area of significant research in the field of search engine technology is how to improve the efficiency and quality of results for a given search query. So called concept-based searching involves using statistical analysis on various search criteria in order to identify and suggest alternative search queries that are highly semantically related to the input search query. Identifying alternative, highly correlated search queries can help focus and improve the search results for a given search. Moreover, companies and advertisers present advertising when particular queries are entered. It would be extremely beneficial to such companies and advertisers to associate their advertising with particular queries as well as other semantically related queries.
In an example of a prior art system employing concept-based searching, queries are correlated together depending on the degree to which results returned in the respective queries are the same. Thus, if first and second queries return nearly identical search results, these two queries would be considered highly correlated with each other. An example of concept-based searching is set forth in a paper by H. Daume and E. Brill, entitled, “Web Search Intent Induction via Automatic Query Reformulation,” published for the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HTL/NAACL), Boston, Mass. (2004).
Another example of concept-based searching examines click-through data as an indicator of related search queries. This model inspects the links that are clicked-on from the results of different search queries. If two different queries lead to users clicking on the same URLs, then these two queries would be considered highly correlated. An example of the click-through concept-based searching is disclosed in a paper by D. Beeferman and A. Berger, entitled, “Agglomerative Clustering of a Search Engine Query Log,” published for the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Mass. (2000).
Another promising semantic-based search technology relates to analyzing the input queries themselves to reveal patterns, trends and periodicities over specified time sequences. For example, Vlachos, M., Meek, C., Vagena, Z. and Gunopulos, D. published a paper entitled, “Identifying Similarities, Periodicities and Bursts for Online Search Queries” for the International Conference on Management of Data (SIGMOD), Paris, France (2004) (“Vlachos et al.”), which paper is incorporated by reference herein in its entirety. Vlachos et al. note that different events have different temporal search frequencies. For example, the frequency of the query “cinema” has a peak every weekend, while the frequency of the query “Easter” builds to a single peak each spring and then drops abruptly. The theory behind temporal correlation is that if two search queries exhibit sufficiently similar temporal patterns, they are likely to be semantically related. Vlachos et al. use the query logs stored on one or more servers associated with a search engine (MSN® in their study) to build a time series for each actual query, where the elements of the time series are the number of times that query was searched on a given day.
Using Fourier analysis, Vlachos et al. represent the temporal periodicities in a query's rate over time by Fourier coefficients, and then apply time-series matching techniques to identify other queries with very similar temporal patterns. The matching techniques they employ measure temporal similarity based on the Euclidean distance between the Fourier coefficients. Under this framework, they describe an approach to find the most similar queries to a given query using the several best Fourier coefficients for each query.
The temporal pattern for search engine queries varies over time. For example, the volume of searches is greater during the day than overnight, and the volume of searches is greater during weekdays than on weekends. Models which attempt to identify semantic query matches may have an artificially high correlation between two searches because this natural variance over time is not factored in. For example, FIG. 1 is a sample graph of the number of occurrences of two different search queries measured on a daily basis over a period of a month. The first plot, na, is the number of occurrences of the first query and the second plot, nb, is the number of occurrences of the second query. As can be seen, both plots show decreased occurrences on the weekends relative to the weekdays. Under typical temporal models for finding semantically related search queries, this sort of parallel decreased weekend activity may lead to an artificially high semantic correlation between the two queries, when in fact, this correlation may instead be due to the natural variation of search queries over time.