Information retrieval (IR), of which search engines are a prominent embodiment, strives to provide relevant results to satisfy a searcher's information needs. To that end, a first requirement of an IR system is to have a reasonably good, or even complete, coverage of pertinent information, namely information that a searcher might plausibly find relevant. A related requirement is to have information that is as fresh as possible.
With respect to completeness, one of the biggest problems is the vast amount of data available. According to United States Patent Application, 2006/0106792, Patterson, May 18, 2006, the Internet had more than 200 billion pages in early 200. Of the more than 200 billion web pages, the largest search engines were able to crawl only about 10 billion pages. The large gap between the number of existing Web pages and the amount a search engine can crawl likely will remain large, due to one aspect of how HTML works, namely the absence of a centralized index. With today's growth speed of new pages, it is impractical or even impossible have a complete coverage of the entire Web.
With respect to freshness, one of the biggest problems is that crawling, the prevailing method for a search engine to gather information, introduces delays of up to days if not weeks. In addition, freshness is often dictated by which information is used the most often, not by how valuable freshness might be to the source of the information. For example, U.S. Pat. No. 6,763,362, McKeeth, Jul. 13, 2004, teaches a method for updating information contained in a search engine at least partially based on how often a piece of information is requested by searchers.
The Patterson and McKeeth applications, along with all other extrinsic materials discussed herein, are incorporated by reference in their entirety. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provider herein applies and the definition of that term in the reference does not apply.
The need for completeness and freshness is especially acute when it comes to dealing with information that is time-sensitive, such as prices of products, availability of in-stock products, arrival/departure information. A step in the right direction is illustrated by two kinds of online businesses. One, web hosting services such as the hosting service offered by Web.com, allows a web site's owner update contents with various monthly plans. This way, a web site's owner can update frequently information deemed time-sensitive. Another business is current comparison shopping sites such as ShopZilla.com, NexTag.com, which allows merchants to feed information to the comparison shopping site several times a day. The information fed is typically that of products and their prices, thus is time-sensitive. However, neither business concerned themselves with completeness of information.
In finding new methods that will lead to both completeness and freshness of information, especially for dealing with information that is time-sensitive, consider a pedagogical example: constructing an information system designed to provide parking information for downtown Santa Monica. On one side of the system, there are “publishers” of information. Typically a publisher is also the owner of a parking space. Information to be published includes but is not restricted to how many slots are available at a parking space, how many cars are looking for a spot, or entering, or leaving. Some other information, however, does not necessarily have an “intrinsic” publisher, for example, street parking spots. On the other side of the system, there are “consumers” of information, who want to utilize the information for their parking needs at the moment (as in a driver in the area with wireless access to the system) or in an anticipated future (as in a driver leaving for the area). In this example, there are two notable aspects of the information in question: (i) it is practical to have a complete coverage of the domain; and (ii) the information changes over time, and its usefulness is highly related to its freshness when it is made accessible by consumers.
What is needed in this and other situations is a new framework for IR systems that addresses both completeness and freshness. A first measure would be confining the “domain” of information so that it can be completely, or near-to-completely, proactively collected (“crawling” when such information is contained on web sites). A second measure is to invite “publishers” (or equivalently, authors, owners, etc.) of information to submit it to IR systems, preferably in a verifiable way. A third measure is to invite objective third party to submit corrections to information existing on IR systems, or to submit new information. A four measure is to allow flexibility in granularity of the “unit” of information (e.g., with search engines, such a unit of information is a web page that's identified by a URL.). With these measures, IR systems can achieve both freshness and completeness for a given domain.