This invention relates generally to computer networks, and more specifically to providing an attribute bounded network of computers.
Two of the major challenges facing the World Wide Web (“Web”) are the freshness of data (frequency of update) and depth (quality of coverage) of indexes on data. When a search engine spiders a Web site to update an index, the index is fresh at that time. However, the search engine may not visit that Web site again for several weeks or months, so if the site contains time-sensitive content, such as weekly specials at a grocery store, or events happening around town this weekend, the data may not be indexed until after the data is no longer relevant. Also, search engines' indexing capabilities only manage to reach a percentage of the data on the Web. At best, the majority of Web content is not being indexed.
General purpose search engines present several problems when attempting to relate their indexing activity to any one person's needs. Firstly, the search engines lack depth, as they do not index all the sites that any one user is interested in, but rather follows a structured methodology to choose which pages are indexed. The indexing technique often used is “spidering”, whereby a software process follows (“crawls”) links in Web pages and indexes the linked Web pages. Google™, a popular search engine, advertises over one billion Web pages indexed, but most of these indexed Web pages are not relevant to any one person. Google™ attempts to provide indexing for Web pages that would interest the Web page viewers as a whole, looking at all the viewers as belonging to a single common set. This can benefit viewers who have very common interests that match closely the needs of the single, global set of index entries, because popular Web sites will get indexed more often, thus providing fresher data. But, viewers have no control of what Google™, or any other search engine, spiders and indexes, therefore most of the index data is not relevant to any one viewer.
Additionally, general purpose search engines offer viewers no control over how often a Web site will be indexed, thus effecting the freshness of the index data. When a viewer finds a particular Web site of interest, they cannot influence the spidering schedule of Google™ to keep the Web page index data fresh. It is up to the viewer to visit the Web page each day in order to insure they are aware of updates. Microsoft has offered support for a limited feature inside of Internet Explorer that allows a viewer to have certain “bookmarks” of Web sites automatically reloaded periodically, or on demand, and report any content changes. This technique only works on Web pages that viewers specifically bookmark and does not address relevant content on Web pages the viewer does not know exist. Any search engine can only spider a limited number of pages every day, and the search engine has no idea which Web pages have changed since the last update, so it must spider all the Web pages to detect new data. This results in some Web (popular) pages being spidered daily, and some (less popular, but very relevant to a particular user) spidered weekly, monthly, or not at all.
The second challenge presented to search engines is that much of the content that is available on the Web is stored in databases and not static pages, so that when the search engines spiders the page, they only collect the static page, and not the much larger set of data stored in the database which can be accessed by the static page. Some people have estimated that the information available in databases and custom served pages, is five hundred times larger than the static size of the Internet (see www.brightplanet.com). Therefore search engines only scratch the surface of the potential content available to the user, and thus depending on the search, may be missing the majority of data available.
In order to solve the problems of freshness of data, and depth of data, inherent in general purpose search engines, companies have attempted to use peer-to-peer (“P2P”) and distributed computing technologies. Although these technologies have been successful in other areas, major Web page index companies are not fully utilizing these technologies for indexing and searching the Web. Some companies (e.g., ThinkStream and GoneSilent) have suggested that they will be releasing products using peer-to-peer and distributed computing technologies to perform Web site indexing.
One example technology employs a pre-distributed computing model, in which a central server computer collects a list of all electronic document addresses (URLs) on the Web, and assigns the spidering and indexing of those pages to thousands of client computers connected to the network. With a large enough network of client computers, the entire Web can be effectively spidered daily, or even more frequently. The client computers are each given one or more URLs to spider. As index data is generated, it can then be sent to the central server.
Web page indexing is just one of the areas that suffer from a lack of effective and efficient distributed processing systems. Other systems can benefit from an attribute bounded approach to distributed computing. For example, the Search for Extra Terrestrial Intelligence (“SETI”) project uses spare CPU cycles belonging to Internet volunteers to analyze a block of recorded radio signals for variations that may indicate another source of intelligence within the universe. This is known as the SETI@home project. Once a block of recorded radio signals is processed by a member of the SETI@home network it can be returned to the centralized SETI server. This process, when replicated tens or hundreds of thousands of times, has the capacity to analyze data more closely than is currently possible with existing SETI computers.
Napster is an online file sharing coordination system that allows client computers to search for and transfer files using a peer-to-peer network mechanism over the Internet. Clients of Napster connect to the Napster central server and upload information about files (typically .MP3 music files) located on the client's computer. This information can include the file's name, a description of the file, a location of the file and some information about the transmission speed of the client computer's connection to the network. The uploaded information is indexed in a searchable database on Napster's central server. A client can then access the index and search for a particular file (e.g., song). If a match is found, information on the location of the file and transmission speed of the connection is made available to the client. The client then uses software to initiate a direct transaction with the computer having the file in order to download the file to the requesting computer. This peer-to-peer file transfer with central server coordination does not allow attribute bounded regions as part of the process.
Another peer-to-peer system, Gnutella, provides fully distributed information sharing without the use of a central server. Gnutella client software creates a mini search engine and file sharing system between computers connected on a network. Computers in a Gnutella network are identified by an IP address, each computer has a list of “first degree” IP addresses, these are the computers that the software will contact in order to execute a search. Each of these first degree computers also have a list of IP address that they can contact (“second degree”) IP addresses. This process can repeat until all the contacted computers have exhausted their lists, but the system allows a “time to live” setting to limit the degree of contact (e.g., 5 levels). Connecting to subsequent computers in a Gnutella network is based upon accessing computers that others have already accessed. Any search is influenced by the previous activity of computer contacted during the search.