1. Field of the Invention
This invention relates to computer networks, and more particularly to a system and method for providing a distributed information discovery platform that enables discovery of information from distributed information providers.
2. Description of the Related Art
It has been estimated that the amount of content contained in distributed information sources on the public web is over 550 billion documents. In comparison, leading Internet search engines may be capable of searching only about 600 million pages out of an estimated 1.2 billion “static pages.” Due to the dynamic nature of Internet content, much of the content is unsearchable by conventional search means. In addition, the amount of content unsearchable by conventional means is growing rapidly with the increasing use of application servers and web enabled business systems.
Crawlers currently may take three months or more to crawl and index the web (Google numbers), so that conventional, crawler-based search engines such as Google may best perform when indexing static, slowly changing web pages such as home pages or corporate information pages. Targeted or restricted crawling of headline or other metadata is possible (such as that done by moreover.com) but this limits search ability. Web resources that do not have a “page of contents” or similar index—“deep” web resources—may be more difficult to search, index, or reference by conventional crawler-based search engines. For example, Amazon.com contains millions of product descriptions in its databases but does not have a set of pages listing all these descriptions. As a result, in order to crawl such a resource, it may be necessary—though difficult—to query the database repeatedly with every conceivable query term until all products are extracted. Likewise, many web pages are generated dynamically given information about the consumer or context of the query (time, purchasing behavior, location, etc.), a crawler approach is likely to lead to distortion of such data. In some situations, content may be inaccessible due to access privileges (e.g. a subscription site), or for security reasons (e.g. a secure content site).
Conventional search mechanisms also may be less efficient than desirable in regard to some types of information providers, for example in regards to accessing dynamic content from a news site. A current news provider may provide content created by editors and stored in a database as XML or other presentation neutral form. The news provider's application server may render the content as a web page with associated links using templates. Although the end user may see a well-presented page with the relevant information, for a crawler-type search engine to extract the content of the HTML page it must be programmed to use information about the structure of the page and “scrape” the content and headline from the page. It may then store this content or a processed version for indexing purposes in its own database, and retrieve the link and story when a query matching the story is submitted. This search process is inherently inefficient and prone to errors. In addition it gives the content provider no control over the format of the article or the decision about which article to show in response to a query.
It would be desirable for search mechanism of the web to perform “deep searches” and “wide searches.” “Deep search” may find information embedded in large databases such as product databases (e.g. Amazon.com) or news article databases (e.g. CNN). “Wide searches” may reach a large distribution. Moreover, it would be desirable for the search mechanism to efficiently use bandwidth and maximize search speed while avoiding bottlenecks. It would also be desirable for a search mechanism to function over an expanded web covering a wide array of distributed devices (e.g. PCs, handheld devices, PDAs, cell phones, etc.).