The present invention relates generally to data distributed within a network, and more particularly to a method and system for generating and updating an index or catalog of object references for data distributed within a network such as the Internet.
In the last several years, the Internet has experienced exponential growth in the number of Web sites and corresponding Web pages contained on the Internet. Countless individuals and corporations have established Web sites to market products, promote their firms, provide information on a specific topic, or merely provide access to the family""s latest photographs for friends and relatives. This increase in Web sites and the corresponding information has placed vast amounts of information at the fingertips of millions of people throughout the world.
As a result of the rapid growth in Web sites on the Internet, it has become increasingly difficult to locate pertinent information in the sea of information available on the Internet. As will be understood by those skilled in the art, a search engine, such as Inktomi, Excite, Lycos, Infoseek, or FAST, is typically utilized to locate information on the Internet. FIG. 1 illustrates a conventional search engine 10 including a router 12 that transmits and receives message packets between the Internet and a Web crawler server 14, index server 16, and Web server 18. As understood by those skilled in the art, a Web crawler or spider is a program that roams the Internet, accessing known Web pages, following the links in those pages, and parsing each Web page that is visited to thereby generate index information about each page. The index information from the spider is periodically transferred to the index server 16 to update the central index stored on the index server. The spider returns to each site on a regular basis, such as every several months, and once again visits Web pages at the site and follows links to other pages within the site to find new Web pages for indexing.
The index information generated by the spider is transferred to the index server 16 to update a catalog or central index stored on the index server. The central index is like a giant database containing information about every Web page the spider finds. Each time the spider visits a Web page, the central index is updated so that the central index contains accurate information about each Web page.
The Web server 18 includes search software that processes search requests applied to the search engine 10. More specifically, the search software searches the millions of records contained in the central index in response to a search query transferred from a user""s browser over the Internet and through the router 12 to the Web server 18. The search software finds matches to the search query and may rank them in terms of relevance according to predefined ranking algorithms, as will be understood by those skilled in the art.
As the number of Web sites increases at an exponential rate, it becomes increasingly difficult for the conventional search engine 10 to maintain an up-to-date central index. This is true because it takes time for the spider to access each Web page, so as the number of Web pages increases it accordingly takes the spider more time to index the Internet. In other words, as more Web pages are added, the spider must visits these new Web pages and add them to the central index. While the spider is busy indexing these new Web pages, it cannot revisit old Web pages and update portions of the central index corresponding to these pages. Thus, portions of the central index become dated, and this problem is only being exacerbated by the rapid addition of web sites on the Internet.
The method of indexing utilized in the conventional search engine 10 has inherent shortcomings in addition to the inability to keep the central index current as the Internet grows. For example, the spider only indexes known Web sites. Typically, the spider starts with a historical list of sites, such as a server list, and follows the list of the most popular sites to find more pages to add to the central index. Thus, unless your Web site is contained in the historical list or is linked to a site in the historical list, your site will not be indexed. While most search engines accept submissions of sites for indexing, even upon such a submission the site may not be indexed in a timely manner if at all. Another shortcoming of the conventional search engine 10 is the necessity to lock records in the central index stored on the index server 16 when these records are being updated, thus making the records inaccessible to search queries being processed by the search program while the records are locked.
Another inherent shortcoming of the method of indexing utilized in the conventional search engine 10 is that only Standard General Markup Language (SGML) information is utilized in generating the central index. In other words, the spider accesses or renders a respective Web page and parses only the SGML information in that Web page in generating the corresponding portion of the central index. As will be understood by those skilled in the art, due to the format of an SGML Web page, certain types of information may not be placed in the SGML document. For example, conceptual information such as the intended audience""s demographics and geographic information may not be placed in an assigned tag in the SGML document. One skilled in the art will appreciate that such information would be extremely helpful in generating a more accurate index. For example, a person might want to search in a specific geographical area, or within a certain industry. By way of example, assume a person is searching for a red barn manufacturer in a specific geographic area. Because SGML pages have no standard tags for identifying industry type or geographical area, the spider on the server 14 in the conventional search engine 10 does not have such information to utilize in generating the central index. As a result, the conventional search engine 10 would typically list not only manufacturers but would also list the location of picturesque red barns in New England that are of no interest to the searcher.
There are four methods currently used to update centrally stored data or a central database from remotely stored data: 1) all of the remotely stored data can be copied over the network to the central location, 2) only those files or objects that have changed are copied to the central location, 3) a transaction log can be kept at the remote location and transmitted to the central location and used to update the central location""s copy of the data or database, and 4) a differential can be created by comparing the remotely stored historic copy and the current remotely stored copy, this differential can then be sent to the central location and incorporated into the centrally stored historic copy of the data to create a copy of the current remotely stored copy. All of these methods rely on duplicating the remote data when in many cases the only thing needed is a reference or a link to the remote data.
Some Internet search engines, such as Infoseek, have proposed a distributed search engine approach to assist their spidering programs in finding and indexing new web pages. Infoseek has proposed that each web site on the Internet create a local file named xe2x80x9crobots1.txtxe2x80x9d containing a list of all files on the web site that have been modified within the last twenty-four hours. A spidering program would then download this file and from the file determine which pages on the web site should be accessed and reindexed. Files that have not been modified will not be indexed, saving bandwidth on the Internet otherwise consumed by the spidering program and thus increasing the efficiency of the spidering program. Additional local files could also be created, indicating files that had changed in the last seven days or thirty days or containing a list of all files on the site that are indexable. Under this approach, only files in html format, portable data format, and other file formats that may be accessed over the Internet are placed in the list since the spidering program must be able to access the files over the Internet. This use of local files on a web site to provide a list of modified files has not been widely adopted, if adopted by any search engines at all.In addition to their search engine sites maintained on the Internet, several search engines, such as AltaVista(copyright) and Excite, have developed local or web server search engine programs that locally index a user""s computer and integrate local and Internet searching. At present, a typical user will use the xe2x80x9cFindxe2x80x9d utility within Windows to search for information on his personal computer or desktop, and a browser to search the Internet. As local storage for personal computers increases, the Find utility takes too long to retrieve the desired information, and then a separate browser must be used to perform Internet searches. The AltaVista(copyright) program is named AltaVista(copyright) Discovery, and generates a local index of files on a user""s personal computer and provides integrated searching of the local index along with conventional Internet searches using the AltaVista(copyright) search engine.
The AltaVista(copyright) Discovery program includes an indexer component that periodically indexes the local set of data defined by the user and stores pertinent information in its index database to provide data retrieval capability for the system. The program generates a full indexing at the time of installation, and thereafter incremental indexing is performed to lower the overhead on the desktop. In building the local index, the indexer records relevant information, indexes the relevant data set, and saves each instance of all the words of that data, as well as the location and other relevant information. The indexer handles different data types including Office""97 documents, various types of e-mail messages such as Eudora, Netscape, text and PDF files, and various mail and document formats. The indexer also can retrieve the contents of an html page to extract relevant document information and index the document so that subsequent search queries may be applied on browsed documents.
A program offered by Excite, known as Excite for Web Servers (xe2x80x9cEWSxe2x80x9d), gives a web server the same advanced search capabilities used by the Excite search engine on the Internet. This program generates a local search index of pages on the web server, allows visitors to the web server to apply search queries, and returns a list of documents ranked by confidence in response to the search queries. Since the program resides on the web server, even complex searches are performed relatively quickly because the local search index is small relative to the index created by conventional search engines on the Internet.
The local search engine utilities just described are programs that execute on a web server or other computer to assemble information or xe2x80x9cmeta dataxe2x80x9d about files or other objects on that computer. The assembled meta data is retained and used at the computer. There are other programs that execute on a computer, assemble information about files or other objects on the computer, and then send the information across a network where it is assembled into a database, in the form of xe2x80x9cvirusesxe2x80x9d. A virus is a piece of software designed and written to adversely affect your computer by altering the way it works without your knowledge or permission. Most virus programs are built either to prove a point (that security could be breached), display an annoying if harmless message that the author felt was important, or to destroy data. Very rarely are they designed to collect or transmit data, due to the complexity of internetworking communications. Information on viruses is understood by those skilled in the art, and is readily available from prominent virus protection software firms such as Symantec/Norton, McAffee, and Dr. Solomon.
Several types of virus programs collect and transmit information obtained from files on a computer that are accessed by the virus. One such virus, though not necessarily categorized as a virus, is a program which is loaded without the user""s knowledge and reports information about the user, the programs installed on the computer, or the user""s usage habits to another computer across the Internet for data collection purposes. There have been several well-publicized cases of major software companies including code in application programs which perform this sort of function when a computer is attached to the Internet. Usually (though not always), the software companies in question have published information which informs users of means by which this activity may be halted. Technically, these xe2x80x9cvirusesxe2x80x9d are an original part of the application program and so are not generally considered a virus.
Another of this type of virus that has recently appeared affects only Internet servers, usually UNIX based, which have lax security administration. This type of virus is known as a xe2x80x9cmail relay virusxe2x80x9d, and is designed to use system resources for forwarding bulk unsolicited email. The virus program is loaded by a person who manages to pierce the root account security and copy a series of programs to a hidden directory on the system. These programs contain a list of machines which are known to have the same program installed and their TCP/IP addresses. The program then discovers (via system configuration files) what the upstream email server is for the local system, and begins accepting and forwarding bulk email through the system. Typically, most Internet service providers do not allow incoming mail from someone outside of the subnet that the mail server is on, hence the need to infect a machine on that subnet. Once the programs are loaded, the TCP/IP address of the infected machine is sent back to the developer of the virus and is incorporated in future versions.
Another virus of this type is known as the xe2x80x9cW97M/Marker.C.xe2x80x9d This Word 97 macro virus affects documents and templates and grows in size by virtue of tracking infections along the way and appending the victim""s name as comments to the virus code. Files are written to the hard drive on infected systems: one file prefixed by C: HSF and then followed by random generated eight characters and the .SYS extension, and another file named xe2x80x9cc: netldx.vxdxe2x80x9d. Both files serve as ASCII temporary files. The .SYS file contains the virus code and the .VXD file is a script file to be used with FTP.EXE in command line mode. This ftp script file above is then executed in a shell command sending the virus code which now contains information about the infected computer to the virus author""s web site called xe2x80x9cCodeBreakers.xe2x80x9d
There is a need for a method and system of indexing or cataloging remotely stored data that eliminates the need to copy the remote data to a central location and for indexing the world wide web that eliminates the need for spiders to be utilized in updating the index so that an up-to-date index is provided for performing searches, and that allows conceptual information to be utilized in generating the index to make search results more meaningful.
The present invention utilizes a bottom-up approach to index or catalog objects on a network instead of relying on a top-down approach as used by conventional search engines. The network that is indexed may be any network, including the global computer network which is known as the Internet or the World Wide Web. The result of indexing is a catalog of object references. Each object reference is a pointer which specifies a location or address where the object may be found. For purposes of the following discussion, each object consists of both contents (meaning only the essential data itself and not a header) and associated xe2x80x9cmeta dataxe2x80x9d. The meta data includes all information about the contents of an object but not the contents itself. The meta data includes any information that has been extracted from the contents and is associated with the object, any header information within the object, and any file system information stored outside of the object such as directory entries. The term xe2x80x9cobjectxe2x80x9d is used only to refer to anything stored on a site of interest to a person who might access the site from the network and its associated meta data. To avoid confusion, the term xe2x80x9cobjectxe2x80x9d is not used more broadly.
According to one aspect of the present invention, instead of using a central site including spidering software to recursively search all linked web pages and generate an index of the Internet, independent distributed components are located at each web host that report meta data about objects at the web host to the central server. A web host is the physical location of one or more web sites. A central catalog of object references is compiled on the central site from the meta data reported from each web host. According to another aspect of the present invention, one or more brochure files are created and stored within each web site to provide conceptual or non-keyword data about the site, such as demographics and categorization information, related to one or more parts of the web site. This conceptual information is then utilized in constructing the central catalog so that more accurate search results may be generated in response to search queries applied to the catalog.
According to one aspect of the present invention, a method constructs a searchable catalog of object references to objects stored on a network. The network includes a plurality of interconnected computers with at least one computer storing the catalog. Each computer that stores the catalog is designated a cataloging site. The other computers on the network store a plurality of objects and are each designated a source site. The method includes running on each source site a program that processes the contents of, and meta data related to, objects stored on the source site, thereby generating, for each processed object, meta data describing the object. The generated meta data is then transmitted from each source site to at least one cataloging site. The transmitted meta data is then aggregated at each cataloging site to generate the catalog of object references. Each source site may also be a cataloging site, and each item of transmitted meta data may also include a command to the cataloging site instructing the cataloging site what to do with the item of meta data.
According to another aspect of the present invention, a method constructs a searchable catalog of file references on a cataloging computer on a computer network. The network includes a plurality of interconnected source computers each having a file system for identifying files. The method includes running on each source computer a program that accesses the file system of the source computer, thereby identifying files stored on the source computer and collecting information associated with the identified files. The collected information is then transmitted from the source computer to the cataloging computer. The transmitted collected information is then processed at the cataloging computer to generate a catalog of file references. The collected information may be a digital signature of each identified file, information from meta data for the file such as file names or other directory entries, or any form of object reference. The collected information may be transmitted responsive to a request from the cataloging computer or at the initiation of each source.
According to a further aspect of the present invention, a method constructs a searchable catalog of object references on a cataloging computer on a computer network. The computer network further includes a plurality of interconnected source computers. The method includes running on each source computer a program that accesses a file system structure of the source computer and creates a data set specifying the file system structure. At the initiation of each source computer the data set is transmitted from the source computer to the cataloging computer. The transmitted data sets are then processed at the cataloging computer to generate the catalog of object references. The file system structure may include a plurality of directory entries for files stored on the corresponding source computer.
According to another aspect of the present invention, a method constructs a searchable catalog of object references from objects stored on a network. The network includes a plurality of interconnected computers with one computer storing the catalog and being designated a cataloging site and each of the other computers storing a plurality of objects and being designated a source site. The method includes running on each source site a program that assembles meta data about objects stored on the source site. The assembled meta data is then transmitted from each source site to the cataloging site at a scheduled time that is a function of resource availability on one or both of the source site and the cataloging site. The transmitted data is then processed at the cataloging site to generate a catalog of object references. According to another aspect of the present invention, the source site program may be scheduled to run at times that are determined by resource availability on the source site and the assembled meta data may be transmitted independent of resource availability. The assembled meta data may be differential meta data indicating changes in current meta data relative to previous meta data.
According to a further aspect of the present invention, a method constructs a searchable catalog of rankings from objects stored on a network. The network includes a plurality of interconnected computers with a cataloging site and plurality of source sites as previously described. The method includes running on each source site a program that assembles data relating to objects stored on the source site. The assembled data is then ranked, in whole or in part, as a function of a set of ranking rules and rankings are then assigned to the assembled data. The rankings are transmitted from each source host to the cataloging host, and aggregated at the cataloging host to generate the catalog of rankings. Each ranking may have a value that is function of human input data about one or more objects with which the ranking is associated. The assembled data may include data from the content of objects stored at the host as well as meta data relating to objects stored at the host.
According to another aspect of the present invention, a method rates objects stored at a site on a network and constructs a searchable catalog of ratings. The network includes a plurality of interconnected computers with access to the objects. The method includes running on the host a program that processes objects stored on the site and assembles values found in a least one of the objects for comparison to a list of rating values. A rating is then generated for each object by relating the values found in the object to the list of rating values. The ratings are then aggregated to generate the catalog of ratings. In the list of rating values, each rating value may be a word or a pattern in other data which is recognized. In generating a rating for each object, the values found in the objects may be compared to a list of human input rating values supplied by an owner of the site and to a second list of human input rating values supplied by a host of the site.
A further aspect of the present invention is a method of monitoring objects stored on a network to detect changes in one or more of the objects. The network includes a plurality of interconnected computers with one computer assembling the results of monitoring and being designated a central site. Each of the other computers stores a plurality of objects and is designated a source site. The method includes running on each source site a program that assembles meta data about objects stored on the source site. The assembled meta data is compared on the source site to meta data previously assembled to identify changes in the meta data. Portions of the assembled meta data that have changed are then transmitted from each source site to the central site. The changes may be transmitted according to a predetermined schedule, and the meta data may include object references and/or a digital signature for each object.
Another aspect of the present invention is a method for monitoring objects stored on a network to detect changes in one or more of the objects. The network includes a plurality of interconnected computers with one computer assembling the results of the monitoring and being designated a central site and each of the other computers storing a plurality of objects and being designated a source site. The method includes running on each source site a program that processes objects stored on the source site and generates for each processed object a digital signature reflecting data of the object where the data consists of the contents or meta data of the object. The generated signatures are transmitted from each source site to the central site. Each transmitted signature is then compared at the central site to a previously generated signature for the object from which the signature was derived to determine whether the data of the object has changed. Either the source site or the central site may initiate running of the program on the source site. The objects on the source site that are monitored may be accessible only from the source site and not accessible by other sites on the network. The digital signature for each object may consist of information copied from a directory entry for the object, or may consist of a valve generated as a function of the contents of the object or any other set of information that reflects changes to the object. This method may be implemented with traditional spidering so that only objects which have changed need to be respidered and parsed.
Another aspect of the present invention is a method of constructing a catalog of object references to objects on a site in a network having a plurality of sites. The objects on the site are not accessible to other sites in the network. The method includes running on the site a program that generates meta data from the contents of objects on the site and assembling the meta data to construct the catalog of object references. The catalog may be stored on the same site as the objects, or the catalog may be assembled on a central site that is not the same site where the objects are located. The object references may remain in the catalog even though the object relating to a particular object reference no longer exists on the corresponding site in the network.
According to a further aspect of the present invention, each of the previously recited methods is performed by a program contained on a computer-readable medium, such as a CDROM. The program may also be contained in a computer-readable data transmission medium that may be transferred over a network, such as the Internet. The data transmission medium may, for example, be a carrier signal that has been modulated to contain information corresponding to the program.