Computers are useful for storing and providing access to large amounts of information. The explosive growth of the Internet has provided access to a tremendous amount of information from an extremely wide variety of sources. The Internet comprises computers and data networks interconnected through data communication links. The World Wide Web (“Web”) portion of the Internet allows a server computer to send graphical web pages to a remote client computer. The remote client computer typically displays received web pages using a browser application (e.g., Mozilla Firefox or Microsoft Internet Explorer). To request a web page, a client computer specifies a Uniform Resource Locator (URL) of the web page in a request (e.g., a Hyper-Text Transfer Protocol (“HTTP”) request). The request is forwarded and received by a web server capable of furnishing the requested web page. When that web server receives the request, it sends the specified web page to the client computer.
The Web comprises millions of “web sites” with each site having a number of web pages. Each web site comprises one or more server computers for responding to requests from client computers for web pages. Some web sites provide web pages or web page content to client computers based on the web pages of other web sites. For example, a search engine is a type of web site that indexes information available on the Web. Typically, a search engine web site operates by returning, in response to receiving a search query from a client computer, a search result web page that lists links to the web pages of other web sites that best match the query.
As another example of a web site that provides content based on a web page of another web site, an advertising web site may provide one or more advertisements to be displayed on a web page that is served by another web site. For example, the other web site may serve a web page to a client computer containing code which, when executed by a browser application on the client computer, causes the browser application to send a request to the advertising web site. The request may specify the URL of the web page served by the other web site. In response to receiving the request, the advertising web site may return advertising content to be displayed by the browser application in conjunction with display of the web page from the other web site.
A web site may take one or more actions based on the web pages of other web sites. For example, a search engine web site, when creating an index of web pages accessible on the Web, may extract attributes (e.g., text, graphics, or images) from the web pages of other web sites so that the extracted attributes can be displayed in search result web pages. The web site may extract relevant attributes from web pages based on, for example, the application of a set of content extraction rules to the content of the web pages. However, in many cases, because of the diversity of web page content and layout, it is difficult to compose a set of content extraction rules that extract the appropriate information from all web pages. Further, the Web comprises many millions of web pages. More and more web pages become accessible on the Web every day. Thus, applying all content extraction rules to all web pages may not be practical.
In addition to or instead of applying content extraction rules to the web pages of other web sites, a web site may provide content to be displayed on a specified web page served by another web site. As an example, an advertising web site may need to determine which of many possible advertisements to display on a web page of another web site. According to one possible solution, an advertising web site, in response to receiving a request specifying the URL of the web page on which advertisements are to be displayed, could retrieve the web page from the other web site and apply one or more content extraction rules to determine which of the many possible advertisements should be displayed on the web page. However, this solution is time consuming as it requires the advertising web site to connect over a network to the other web site, retrieve the web page from the other web site, and apply the content extraction rules to the web page. The amount of time needed to do this may be too great in the context of responding to a request for advertising content. Further, in the context of many concurrent requests for advertising content, this solution may be too resource intensive on the advertising web site. Further, as mentioned above, all content extraction rules may not be applicable to all web pages.
What is needed then is a solution that enables a web site to efficiently determine which subset of a set of actions applies to a specified web page. Specifically, the solution should enable a web site to make this determination without having to analyze the content of the web page. The present invention provides a solution for theses and other needs.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.