The Internet has become an important computerized network, which can be accessed by computer users worldwide. Over the years, the number of Internet sites and the available content on the Internet has grown. Only in some cases does an Internet user already know the specific website address (e.g., www.______.com) to enter to access a desired website. Often Internet users want to retrieve certain subject matter, without being able to provide the Internet address(es) (i.e., the domain) at which such subject matter may be located. Such users want to be able to have a search of the Internet performed for them to retrieve their desired subject matter, and conventionally they have done so via so-called “search engines”, such as www.google.com. For such searching, a user generally starts by visiting the search engine site (e.g., www.google.com), where the user encounters a field in which to enter his desired word(s) or phrase to search.
An Internet user querying a popular commercial search engine such as www.google.com or www.altavista.com will get back a listing of content that the search engine deems relevant to the query. While some search engines perform this task better than others, all known search engines before the present invention suffer from a common problem: a majority of the content they return is old. For example, with conventional search engines sometimes the most current content returned has been between one and three months old. This datedness is a real problem for some types of content, such as current events and news stories.
Current search engines work from an index to locate Web documents that satisfy a specified search criteria. The index is a limiting factor for the search engine, i.e., if the index only has “old” content, the search engine can do no better in returning information to the user query.
Preparation and maintenance of such an index conventionally has been accomplished by a “web crawler”, which is a computer program that automatically retrieves numerous Web documents from one or more Web sites. A Web crawler processes the received data, preparing the data to be subsequently processed by other programs, such as creation of a search engine-useable index of documents available on the Internet.
Conventional crawlers have been proposed and are in use. However, the problem of returning current content continues to go unsolved. The fault in the current search engine systems of failing to return current content arises from a combination of two problems that have yet to be addressed.
The first problem is the slow scan rate at which search engines currently look for new and changed information on a network. The best conventional crawlers visit most web pages only about once a month. Because the scan rates of these conventional search engines are so slow there is no way for them to capture a majority of all the fresh content that is available. One reason why these crawlers scan so slowly is their dependence on a centralized crawling method where all of the crawlers crawl from a small number of sites on the network. This set-up causes a lot of the downloaded information to traverse the same network pipe. To reach high network scan rates on the order of a day with such an approach would be impractical, for requiring too enormous an amount of bandwidth flowing to a small number of locations on the network, because the cost would not be economically feasible. Due to such economics, most search engines have a scan rate of much slower than once a day.
A second problem that occurs is that current search engines do not incorporate new content into their “rankings” very well. Conventional search engines use certain methods to arrive at an Authority Measure for a page. For example, Google's PageRank ranking technology depends on the number of links a to-be-ranked page has linking to it in order to decide on the weight of the to-be-ranked page. Because new content inherently does not have many links to it, it will not be ranked very high under Google's PageRank scheme or similar schemes. Thus, for certain search engines, even if the search engine identifies a site as having relevant content, if the website has few links to it because of its newness, the search engine will rank it low in the list of retrieved site addresses that the searching user views. Some search inquiries can return thousands or more of addresses responsive to the request, so that being a low-ranked result of the search decreases the likelihood that the searcher will actually view the content.
Like Google, other conventional search engines also derive an Authority Measure for a page based on the number of links that point to the page. Thus normally an Authority Measure will be low for new pages. Newly created content, being new, is unknown to most people, and, not knowing about it, people have not put HTML (HyperText Markup Language) links in their documents pointing to it. Under conventional systems, new content maintains a low score, until more people find out about it and link to it.
Search engines fall into the general category of web-mining applications. These applications collect and extract large amounts of data from the web, for further processing. In the case of search engines, this further processing is the construction and maintenance of searchable indexes. Many other processing methods can be performed on this data. Examples of such applications include event notification systems, market analysis, corporate intelligence, etc. The field of data-mining is closely related to web-mining: in data-mining, data is usually processed from a database, whereas in web-mining, data is primarily processed from information on the web. The architecture of conventional web mining applications is-shown in FIG. 5A. Conventional web mining applications use polling methods, in which the applications must continually poll the data available on the network (such as the Internet) to determine what is there and what is changed. Conventionally, all data/web mining, including search engines, corporate intelligence, etc., have been using polling methods. Practically speaking, the conventional methods provide for visiting pages in set lists now-and-then, and seeing what is in the pages. The amount of work to be done when using such polling methods is extraordinarily large.
Somewhat separate from the development of the above-mentioned Internet searching technology, so-called “metacomputer” technology has been developing. The idea of a metacomputer was first popularized by the Seti@home project in 1996, relating to searches for Extra Terrestrials by scanning the sky for intelligent radio signals originating outside the solar system. Metacomputers then developed for more generalized uses.
A metacomputer system manages and contains a large number of machines (managing servers, and the contributor nodes). Together, the system created is a powerful virtual computer. In a metacomputer, like any computer, there is an operating system, and the applications that run on top of the operating system. In the original use by the Seti@home project, the application and operating system (“OS”) were combined, and only the seti application could run on their system. Starting in about the first half of 2000, many companies took up this idea, creating such virtual computers on which people could run their distributed applications.
Such metacomputers require operating systems, and the Share System developed at Johns Hopkins by Jacob Green and John Schultz was an early development of such a virtual computer. Green et al have published information about the Share System, e.g., at www.cnds.jhu.edu. A metacomputer such as Share has a two-component basic architecture consisting of the Contributor Environment (CE) which runs on contributors' machines, and the Allocation Servers (AS) that hand out jobs to the CEs. Another such metacomputer was constructed by what was formerly known as PopularPower before March 2001 when it went out of business under that name. Another metacomputer is that of Distributed Science.net (created when ProcessTree merged with Dcypher.net). Other operating systems include Entropia.com, AppliedMeta.com, UnitedDevices and DataSynapse.
Eichstaedt et al. in U.S. Pat. No. 6,182,085 issued Jan. 30, 2001 for “Collaborative Team Crawling: Large Scale Information Gathering Over the Internet”, recognize the need to make a crawler (gatherer) more efficient, and provide a method using multiple processors for collaborative web crawling and information processing. They use a set of crawlers running at the same location. However, a need still remains for systems for maximally retrieving, indexing, rating and making available current content on a network.
U.S. Pat. No. 6,151,624 to Teare et al., issued Nov. 21, 2000, entitled “Navigating Network Resources Based on Metadata”, provides for the crawler to execute every 24 hours. (Column 17.) The crawler polls Web sites on the Internet to locate customer sites that have updates, and a database is updated. (Column 18.) Although the crawler is commanded to execute every 24 hours, index files are only updated weekly based on the database. (Columns 17-18).
Thus, improved technology is needed for successfully gathering fresh content from a network such as the Internet especially that can operate without getting bogged down by the vast amount of unchanged content on the Internet. Also, there remains a need for technology to effectively rank new content.