Not Applicable
1. Field of the Invention
The invention relates to the field of Internet Search Engines, Web browsers, and resource gathering and has special application in situations where these functions must be implemented in extremely large networks.
2. Description of the Related Art
One of the greatest challenges for a repository that contains summary data (metadata) of external data resources is to keep this metadata up to date. Currently the broad solution is to periodically exhaustively search (recrawl) the stored resources, summarize them and store the update summary data in the repository while replacing the old data. This produces a huge workload. Considering a repository with 100 million documents and an average download and process time of 30 seconds could lead also to a very time consuming experience.
The problem present with the prior art is the inherent difficulty for web crawlers to adequately search and process the vast amounts of information available on the Internet. Referring to FIG. 1, a typical web crawler (0101) would use one or more communication media (0111, 0112, 0113) with corresponding communication links (0121, 0122, 0123) to access a plethora of Internet web sites (0131, 0132, 0133) and thus incur 100% of the computing and time penalty cost for performing the search and organizing the data.
As stated previously, with the volume of data available on the Internet increasing at an exponential rate, the resources required to perform this search and organizing effort is substantial and becoming a significant burden on Internet web search engines. While current approaches to this problem involve the use of additional web crawlers (0101), this represents an unacceptable cost burden on Internet search sites given the recent trends towards nonlinear data growth in the Internet.
Accordingly, a need exists for a method and a system to permit reduction of the resources required for Internet search engines and their corresponding data retrieval and organization tasks.
The present invention is related in the area of today""s Internet search engines consist roughly of two major parts. One part is responsible for resource gathering. The other part handles the information storage and indexing. The present invention addresses one problem that arises while working on resource gathering using web crawlers or gatherers.
By using a parallel architecture for the gatherers and using new approaches (distributed team crawling), the amount of time to perform a complete web search and index can be dramatically reduced. However, it still takes a considerable amount of time and resources to keep the repository up to date. Adding more resources is generally the sole responsibility of the search engines owners. The present invention seeks to reduce the time and resources spend by the search engine companies by placing some of the resource gathering tasks in the hand of the user""s web browser. The present invention would typically load a small program (e.g. Java applet) into the browser that would perform some specified resource gathering and summarization (lightweight task). A user of the search engine would still perform all the steps they currently perform in activating a search, including: starting at the home page of the search engine; typing some keyword(s) and selecting xe2x80x9cstart searchxe2x80x9d; and viewing the results screen and selecting from the results.
The present invention could be loaded for as many URL""s as the search engine owner decides. However, to the user of the search engine the present invention does not need to appear graphically and may operate as a background task. With the present invention in place, a small program is loaded into the user""s Internet browser and can be directed to perform several information gathering tasks such as: crawl a specified URL; inform the search engine if the site has updated since a particular date; and inform the search engine of any changes to web data since a particular date.
The present invention generally operates within the context of the user""s computer on a voluntarily basis. Possible motivation for the user to participate in this searching methodology could include (but not be limited by) the following:
Free membership (receive free reviews, articles, research material, free notification service of search results) from the Internet Search engine;
Some reward (based on a specific amount of donated computing resources he/she receives a free T-Shirt, book, CD, etc.);
Though participation the search engine will be more up to date and provide improved of search accuracy. In this manner users of the search engine community may actively help to improve the search quality by their participation.
Because the present invention can be implemented in Java (Sandbox model, etc.) it is secure and cannot inflict any damage to the user""s computer, because it has no write access to the user""s storage systems. Processing results will be sent to the web server using a network connection. Furthermore, Java applets are already a common standard and enjoy a high acceptance among Internet users.