1. Field of the Invention
The present invention is directed generally to computer systems for automatically searching for files on a network and, more particularly, to systems used to locate files on the Internet. The present invention is also directed to software for producing a catalog of the files found on the Internet by such systems.
2. Description of the Invention Background
High-speed networks connect the National Science Foundation (xe2x80x9cNSFxe2x80x9d) supercomputers to form a communications backbone known as the NSFNET. The NSFNET is the foundation for the U.S. segment of the Internet. The Internet is a worldwide network of computers connecting over sixty countries. In addition to those countries having full internet access, there are a large number of countries that have something less than full access to the Internet.
In the early days of the Internet, file transfer over the Internet was performed pursuant to a file transfer protocol or FTP. Sites which contain such files are referred to as FTP sites and that portion of the Internet is often referred to FTP space. See FIG. 1. A system called Archie maintains a data base of FTP file names that reside on approximately 1,500 host computers. Thus, Archie is a tool that can be used to locate FTP files in FTP space.
Another tool for locating files on the Internet was developed by the University of Minnesota and is referred to as Gopher. Gopher is a software application that resides on a host computer. There are more than 5,000 gopher servers today and files residing on the Gopher servers are referred to as Gopher space. See FIG. 1. Although Gopher represented an improvement in user-friendliness, it is impossible to know whether all the information you need about a particular topic resides on the particular Gopher server to which you have connected. Visiting all 5,000 Gopher servers to perform a complete search on a single topic would take an enormous amount of time. Hence, a search tool, Veronica, was developed to search Gopher space.
The latest development on the Internet is the use of the hypertext transfer protocol (xe2x80x9cHTTPxe2x80x9d). The World Wide Web (WWW) is a part of the Internet and represents all the servers that offer access to HTTP space. See FIG. 1. Client programs, referred to as browsers, such as Mosaic, give the user access to and the ability to download files from the WWW as well as Gopher space and FTP space whenever a file in HTTP space has a pointer to such files.
Use of the Internet is growing at a dramatic pace. For example, in 1983 there were approximately five hundred computers connected to the Internet. Today, there are over three million computers connected to the Internet. Information providers are placing information in the form of files on the Internet at a dramatic pace. The rate of growth by new registered Internet sites is 8% to 10% per month, with over 41,500 sites as of February, 1995. There is no central authority which controls the Internet, edits the material placed on the Internet, or performs any type of supervisory role. Thus, the vast amount of information on the. Internet forms a virtual sea of unorganized, unedited information.
In an effort to bring some order to the chaos, efforts have been made to provide a catalog of the Internet so that files can be quickly located and evaluated to determine if they contain useful information. Because of the vast size of the Internet, specialized types of software, referred to as robots, wanderers, or spiders, have been crawling through the Internet and collecting information about what they find. Such robots, however, quickly caused problems. Whenever a robot gained access to a server, the server could be rendered ineffective for its normal purpose while it processed all of the requests for information generated by the robot software. As a result of numerous complaints, guidelines have been developed in which robots perform a search in a manner which avoids a particular server from being seized by the robot. However, such searches often result in particularly relevant files being passed over in favor of much less relevant files.
A second problem is encountered in dealing with the massive amount of information that is uncovered by the robot. Some form of data selection and/or compression is needed to reduce the amount of data retained in the catalog while at the same time maintaining sufficient data to enable the user to make an intelligent choice about the files to be visited. Thus, the need exists for a software robot which can intelligently search through the files of the Internet and for a mechanism for processing the located files for presentation to an end user in a meaningful manner.
The present invention is directed to a method of constructing a catalog of the files stored on a network comprised of a plurality of interconnected computers each having a plurality of files stored thereon. The method is comprised of the following steps:
(a) establishing a queue containing at least one address representative of a file stored on one of the interconnected computers;
(b) ranking each address in the queue according to a heuristic;
(c) downloading the file corresponding to the address in the queue having the highest ranking;
(d) processing the downloaded file to generate certain information about the downloaded file for the catalog;
(e) adding to the queue any addresses found in the downloaded file; and
(f) repeating steps (b) through (f).
According to one embodiment of the present invention, the heuristic used is popularity. Whenever an address in a downloaded file points to a file stored on a computer other than the host computer storing the downloaded file, a counter for the referenced file is incremented. The value in the counter is a measure of the popularity of the referenced file.
According to one embodiment of the present invention, the downloaded file is processed to provide such information as a significant word list, an excerpt of the downloaded file, the address, size of the file, and number of words therein, and to save the file""s title and any headings and subheadings. The significant word list may be used in subsequent searches of the catalog created by the method of the present invention. When a search identifies a file, the information such as the abstract, title, etc. may be provided to an end user who can then determine whether the entire text of the identified file should be downloaded from its original location on the network.
Processing of the files may also include saving information about files mentioned in downloaded files so that that information may be made available during searches even though such files have not be downloaded and fully processed. That enables a catalog to be rapidly constructed according to the method of the present invention. Additionally, because files are downloaded based on their popularity, the files which are added to the catalog are likely to be more important and more meaningful to an end user performing a search in the catalog. Those and other advantages and benefits of the present invention will be apparent from the Description of a Preferred Embodiment hereinbelow.