The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata, within the body of the hypertext markup language (HTML) document that defines the web pages. A computer software product known as a web crawler, systematically accesses web pages by sequentially following hypertext links from page to page. The crawler indexes the pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A conventional crawling operation 200 will now be briefly explained in connection with FIG. 2. A typical web crawler starts at block 201 by performing two main operations in order to execute the crawling process, namely, the access—retrieval of a document (block 202) and then the analysis phase of the document, also called the summarization process (block 204). Today's web crawler might be able to access a dynamically generated document, that is a document generated through executable code (e.g. CGI using Perl, ASP, C, or C++) on the web server.
Oftentimes, web designers embed an executable client side software code in the dynamic documents, so that eventually the code will be replaced with content or generates content on the client side. Examples of this executable code include computation results that are originated based on some user input, or specific text based on the client's web browser version used. More generally, dynamic documents rely on a web browser's capabilities to:                a) retrieve additional documents (block 206) as needed or required, such as frames, in-line images, audio, video, applets, or equivalents;        b) execute client side script (block 208) and code, such as JavaScript® or equivalents;        c) furnish a fault tolerant HTML filter to recognize various HTML standards and interpret HTML markup errors; unscramble content that a web designer has purposefully scrambled in order to thwart crawling and other programmatic analysis methods, to produce a final HTML markup (block 210); and        d) integrate all the previously obtained results to render the document (block 212) for presentation to a user (block 214).        
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the user's search terms, and returns the search of results in the form of HTML pages. Each search result includes a list of individual entries that have been identified by the search engine as satisfying the user's search expression. Each entry or “hit” may include a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
Current web technology is being increasingly used for publishing and delivering mission-critical information to consumers, customers, suppliers, and other entities. The extreme ease with which web data can be formed and published has been instrumental to the success and rapid adoption of the Internet as a preferred communication platform.
As explained herein, crawlers have been developed to automatically retrieve data from various web sites. The data may be used internally (e.g. competitive analysis) or externally (e.g. news feed aggregation). Crawlers can pose concerns to companies that publish their products and services on their web sites, desiring to make the data available to customers, to the exclusion of third parties aiming at invading the companies' own published data to entice customers away from these companies. Price data in particular, constitute sensitive information and a primary source of contention, since these data change frequently and can be the foundation of a price-leadership strategy.
There is currently no adequate mechanism by which the content of web pages can be protected from invading crawlers, without impacting the rendering of the web content to legitimate customers. This problem is further exacerbated by the difficulty in detecting crawlers and discriminating between crawler and web browser requests. The need for such a mechanism and corresponding process has heretofore remained unsatisfied.