The present invention relates generally to the fields of computerized publishing and knowledge management, and more particularly to Web crawler applications used, e.g., by Internet search engines. The invention, however, is not limited to use in a Web crawler. On the contrary, the invention could be used in a mail server, directory service, or any system requiring indexing or one-way replication of a document store.
There has recently been a tremendous growth in the number of computers connected to the Internet. A client computer connected to the Internet can download digital information from server computers. Client application software typically accepts commands from a user and obtains data and services by sending requests to server applications running on the server computers. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the Gopher document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as xe2x80x9cthe Web.xe2x80x9d The Web is an information service on the Internet providing documents and links between documents. It is made up of numerous Web sites located around the world that maintain and distribute electronic documents. A Web site may use one or more Web server computers that store and distribute documents in a number of formats, including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata (commands providing formatting information), as well as embedded links that reference other data or documents. The referenced documents may represent text, graphics, or video.
A Web browser is a client application or, preferably, an integrated operating system utility that communicates with server computers via FTP, HTTP and Gopher protocols. Web browsers receive electronic documents from the network and present them to a user.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
The term xe2x80x9csearch enginexe2x80x9d is often used generically to describe both true search engines and directories, although they are not the same. Search engines typically create their listings automatically by xe2x80x9ccrawlingxe2x80x9d the Web. A directory, on the other hand, depends on humans for its listings, i.e., a person submits a short description for an entire site or editors write a description for sites they review. The present invention is particularly suited (although not necessarily limited) for use in a search engine of the type that gathers information automatically, i.e., by xe2x80x9ccrawlingxe2x80x9d the Web.
Search engines typically include a xe2x80x9ccrawlerxe2x80x9d (also called a xe2x80x9cspiderxe2x80x9d or xe2x80x9cbotxe2x80x9d) that visits a Web page, reads it, and then follows links to other pages within the site. The crawler returns to the site on a regular basis to look for changes. Everything the crawler finds goes into an index, which is another part of the search engine. The index is like a file or container holding a copy of every Web page that the crawler finds. If a Web page changes, then the index is updated with new information. The search engine software, which is yet another part of the search engine, is a program that sifts through the pages recorded in the index to find documents fulfilling a search query submitted by a user. The search engine software will typically rank the matches in accordance with their relevance.
Once it is given a set of start addresses and restriction rules, a crawler can retrieve documents following all recursive links from the documents that correspond to the start addresses that pass the restriction rules. The primary application of the crawler is to build an index of a set of documents, so that the index can be searched by end-users that want to locate documents that match certain search criteria.
A crawler can retrieve documents from different stores. Although the primary store is the Web, a crawler can retrieve documents from a mail store, database, or anything else that has textual content.
A shortcoming of Web crawlers and other applications required to access documents stored in one or more document stores is that resources are wasted on retrieving the documents from the store in order to determine whether the same document has already been processed or indexed. For example, a document must be fetched from a document store and filtered to obtain a hash function, and then the hash function must be compared to the hash functions of previously processed documents to determine whether the new document is a replica of another document already represented in the index. There is a need for an improved method and system for identifying duplicate documents, and using this information to avoid unnecessarily retrieving and processing such duplicates. The present invention achieves this goal.
Further background information about Web crawlers is provided below, and may also be found in U.S. patent application Ser. No. 09/105,758, filed Jun. 26, 1998, xe2x80x9cMethod of Web Crawling Utilizing Crawl Numbers,xe2x80x9d and U.S. patent application Ser. No. 09/107,227, filed Jun. 30, 1998, xe2x80x9cSynchronizing Crawler With Notification Source.xe2x80x9d
The present invention provides an improved way to access documents (including Web pages, file system documents, e-mail messages, etc.) stored in one or more document stores on a computer network. For example, the invention could be used in a Web crawler application, mail server, directory service, or any system requiring indexing or one-way replication of a document store. The invention is particularly directed to a method and system for identifying duplicate documents in a document store, and using this information to avoid unnecessarily retrieving and processing such duplicates.
A Web crawler application in accordance with the present invention takes advantage of a document store""s ability to provide a content identifier (CID) having a value that is either a unique function of the physical storage location of a data object or document, such as a Web page, or, alternatively, a unique function of the content of the document (i.e., identical documents stored in different locations would have equal CIDs). According to the invention, the crawler first tries to fetch the CID for a document. If the CID attribute is not supported by the document store, the crawler processes the document in accordance with a prior method, e.g., by fetching the document, filtering it to obtain a hash function, and committing the document to an index if the hash function is not present in a History Table (or a separate table associated with the History Table). On the other hand, if the CID is available from the document store, it is fetched by the crawler. The crawler then determines whether the CID is present in the History Table, which indicates whether the document in question has already been indexed under a different URL. If the CID is present, indicating that the document has already been indexed, the new URL is placed in the History Table but the document itself is not retrieved from the document store, nor is it filtered again to obtain a CID. If the CID is not present in the History Table or separate CID table, the full document is retrieved and indexed.
Note that, when the CID is a function of the physical location of the document, as in the exemplary implementation described below, it does not achieve better duplicate detection if the duplicate documents are located in different stores (e.g., different Web sites). However, it does solve the problem of locating duplicates within the same site, which is a very relevant problem for sites with multiple virtual directories, or mail stores. On the other hand, the present invention could be implemented such that duplicates at different storage locations (e.g., where a document is copied to another location and not changed) would have equal CIDs and thus would be identifiable as duplicates based on the CID property. Thus, for example, in the latter embodiment a unique CID would be generated whenever a document is modified and stored. If this document is copied elsewhere, but remains unmodified such that it keeps the same CID, then the present invention can be used to detect that duplicates are stored at different locations.
Preferably, the CID data structure will be an extension of a known globally unique ID (GUID). For example, whereas the GUID is a 16-byte number, the CID of the present invention may comprise a 16-byte GUID plus an additional 6-byte number.
Other features of the present invention are described below.