Many attempts have been made to extract knowledge from web pages. These attempts have been motivated, in part, by the breadth of information covered by web pages. In particular, a vast amount of information covering a wide range of objects and events is provided by web pages. The attempts to extract knowledge from web pages can be classified as either structure-based extraction or content-based extraction. Structure-based extraction attempts to identify sets of web pages corresponding to objects and events based on web site structure (e.g., hierarchy of web pages) and a hyperlink structure (e.g., linked-to web pages). Content-based extraction attempts to identify information corresponding to objects and events by segmenting and categorizing content of web pages into groups using various techniques such as natural language processing and probability models.
Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet and collect a vast amount of information related to searching by users. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. The search engine services identify related web pages based on how similar the keywords of the web pages are to the search terms of the query. The search engine services may generate a relevance score to indicate how relevant the information of the web page may be to the search request based on similarity of keywords, on web page importance or popularity, and so on. The search engine services then display to the user links to those web pages in an order that is based on a ranking determined by their relevance.
The search engine services collect information that includes click-through data. Query-based click-through data represents user selection of a link to a page from a search result for a query. For example, if a user submits the query “911,” a search engine service may provide a web page of the search result that includes links to web pages relating to the 9-11 Commission, to the movie named Fahrenheit 9/11, and to the 911 emergency infrastructure. When a user submits the query, the search engine service may log an indication that the user submitted the query “911.” When the user then selects a link from the search result, the search engine service may log an indication that the user selected that link (i.e., a click-through event). The search engine service can then analyze the log to match the selection of links to queries (e.g., via session identifier or IP addresses) and store click-through data that includes query-page pairs along with a time (e.g., click-through time). For example, a query-page pair may have the query “911” and the URL to the official web page of the movie Fahrenheit 9/11 with a time of Jul. 3, 2004 at 12:00:00 hours.
The events of web information relating to the events can be useful in various applications. For example, current web page classification hierarchies are typically based on a subject matter taxonomy. In certain circumstances, it may be useful to have classifications that explicitly correspond to events. For example, it may be useful to have a classification of “release” for pages relating to the release of a movie. Although structure-based extraction and content-based extraction have met with some success in certain applications, such as organizing a web site structure, restructuring search results, terrorism event detection, and so on, these extraction techniques have not been able to effectively detect the relationships between web pages and events that occur over time.