Mining user web search activity potentially has a broad range of applications including web result pre-fetching, automatic search query reformulation, click spam detection, estimation of document relevance and prediction of user satisfaction. This analysis is difficult because the data recorded by search engines while users interact with them, although abundant, is very noisy.
There are large sources of implicit information about user web search interests in the Internet logs that record user actions. In particular, search engines keep records of their interaction with users in click-through logs, which record a temporary user id (through login or cookies), the queries issued by the user, the results returned by the engine and the resulting user clicks.
There are many benefits to tracking and analyzing this search engine behavior, including behavior relating to sequences of queries related to a single query intent or information need. For example, one benefit is for analyzing the effectiveness of a search result, i.e. if the user received the search results they requested. Existing techniques exist on analyzing large pools of information related to common search requests. For example, current techniques utilize analysis operations on the large amount of information available on common search terms, where many users enter these same search terms and the collected tracking information relates to many varied instances of users interacting with search results to the common search term. By way of example, a common search term may be the name of a famous person, event or location, such as for example “The Golden Gate Bridge” is a well-recognized landmark and may have a large number of common user searches.
Although, in actuality, there exists a long tail of search sessions and user interactivity that cannot be analyzed by current analysis techniques. These long tail search sessions represent specific or individualistic search requests that are not in great volume from the general searching public. Therefore, these search sessions do not generate the same pool size of data and existing data analysis operations are inapplicable.