Internet search engines store information from a vast array of web pages retrieved from the Internet, typically implemented through the use of spiders or crawlers. To facilitate the search process, the Internet search engines provide interfaces used to run queries against the indices they build from this information. Generally speaking, Internet search engines build these indices by collecting URLs and following each URL on each page until all URLs for all web pages have been exhausted. During this process, the contents of each web page are analyzed according to various and evolving criteria to determine how particular elements (e.g., titles, headings, files, links, various meta data, and the like), and other related information should be indexed. This index allows information to be found quickly, relevantly, and broadly from a single source.
The automated collection of data available on the Internet is a complicated task. According to U.S. Pat. No. 7,647,351 there is recognized only one primary known means of automatically retrieving information from a web site (without the assistance of the web site owner) utilizing the hidden mark-up language of the web site for correlating useful data. Theoretically, this mark-up can help a computer algorithm locate, process, and interpret information on and about a page. As further noted by the '351 patent, “unfortunately, every Web site has a different look and feel, so each Web page needs its own custom algorithm. Writing a custom algorithm is time-intensive, but possible on a small scale, such as a price comparison website which gathers product information from a dozen sources. But there is no efficient way to scale this approach up to thousands or millions of Web sites, which would require thousands or millions of custom algorithms to be written.” The '351 patent attempts to solve data conformity problems by the use of a manually set up template for each web page with a unique look and feel.
In fact, the computer system seeking to process resources (e.g., web pages, news feeds, PDF documents) available on the Internet is faced with an earlier problem: locating those resources in the first place. In some circumstances, the particular Internet locations of the resources to be processed and interpreted are known a priori (i.e., this resource and that resource, located at these URLs) and can be accessed accordingly. In others, no such knowledge exists, except in the abstract (i.e., it is suspected that the information is available somewhere, but it is not known specifically where).
Therefore, there is a need for flexible Internet data search process that can meaningfully analyze and interpret data from disparate Internet resources, without accessing those resources directly, and without foreknowledge of the existence of or locations of such resources.