Increasingly, web sites store data in database tables and dynamically generate web pages for presentation to a user by querying the data stored in the database tables. For example, a web page may include a portlet that derives selection criteria based on interactions of the user with other portlets within the web page, and dynamically obtains content for displaying to the user. However, dynamic web pages pose a potential problem for Web crawlers, which are used by search engines to obtain data for indexing various web sites. In particular, these Web crawlers may refuse to crawl dynamic web sites since there is a risk that the Web crawler will end up in a request loop that prevents it from moving on to other web pages (e.g., due to state information being encoded in a cookie or URL “cookie jar” fragment). As a result, the search engines do not index dynamic web pages, which reduces the effectiveness of the search engine and the ability of the web site to attract new users.
The problem is compounded for web sites that include protected data. In this case, the web site may only be available using a security protocol, such as HyperText Transport Protocol Secure (HTTPS) and/or require a log in. The use of a security protocol and/or log in enables the content provided to the user to be filtered and/or customized based on the identity of the user. However, since the Web crawler does not include any ability to be authenticated, it will often bypass web sites that include protected data.
For numerous applications, it is desirable that a Web crawler be able to crawl a web site that includes dynamic protected data. To this extent, the web site may include public data that is desirable to have indexed by a search engine for presentation to users in response to search requests. For example, a merchant may have a pricing structure that varies based on the customer. In this case, the merchant may want to have its product offerings and/or descriptions indexed while the corresponding pricing for the products remains protected. Similarly, a content provider may require registration to view its content. However, the content provider may want summaries of the content included by the search engine to increase traffic to the content provider's web site. In one proposed solution, the merchant and/or content provider pays the search engine to include certain content and links to its web site.
In the more general area of responding to Web crawler requests, some web sites have attempted to “cloak” the content provided to Web crawlers. In particular, when the web site determines that a request is received from a Web crawler, the web site will provide alternative content for processing by the Web crawler. Frequently, the alternative content is designed to make the web site appear higher in the results list for a search engine that uses the Web crawler than it otherwise would if the actual content were provided. Subsequently, when a user selects to visit the web site via the search engine, the actual web page is provided to the user. In general, search engine operators do not approve of web sites that cloak content, and a web site may be removed from being processed by the search engine's Web crawler if it is determined that the web site is cloaking its content.
In view of the foregoing, there exists a need in the art to overcome one or more of the deficiencies indicated herein and/or one or more other deficiencies not expressly discussed herein.