The present invention relates generally to the fields of computerized publishing and knowledge management, and more particularly to Web crawler applications used, e.g., by Internet search engines. The invention, however, is not limited to use in a Web crawler. On the contrary, the invention could be used in a mail server, directory service, or any system requiring indexing or one-way replication of a document store.
There has recently been a tremendous growth in the number of computers connected to the Internet. A client computer connected to the Internet can download digital information from server computers. Client application software typically accepts commands from a user and obtains data and services by sending requests to server applications running on the server computers. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the Gopher document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as xe2x80x9cthe Web.xe2x80x9d The Web is an information service on the Internet providing documents and links between documents. It is made up of numerous Web sites located around the world that maintain and distribute electronic documents. A Web site may use one or more Web server computers that store and distribute documents in a number of formats, including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata (commands providing formatting information), as well as embedded links that reference other data or documents. The referenced documents may represent text, graphics, or video.
A Web browser is a client application or, preferably, an integrated operating system utility that communicates with server computers via FTP, HTTP and Gopher protocols. Web browsers receive electronic documents from the network and present them to a user.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
The term xe2x80x9csearch enginexe2x80x9d is often used generically to describe both true search engines and directories, although they are not the same. Search engines typically create their listings automatically by xe2x80x9ccrawlingxe2x80x9d the Web. A directory, on the other hand, depends on humans for its listings, i.e., a person submits a short description for an entire site or editors write a description for sites they review. The present invention is particularly suited (although not necessarily limited) for use in a search engine of the type that gathers information automatically, i.e., by xe2x80x9ccrawlingxe2x80x9d the Web.
Search engines typically include a xe2x80x9ccrawlerxe2x80x9d (also called a xe2x80x9cspiderxe2x80x9d or xe2x80x9cbotxe2x80x9d) that visits a Web page, reads it, and then follows links to other pages within the site. The crawler returns to the site on a regular basis to look for changes. Everything the crawler finds goes into an index, which is another part of the search engine. The index may be viewed as a file or container holding a copy of every Web page that the crawler finds. The primary purpose of the index is to provide a way to quickly look up a document URL based on words specified in a query. If a Web page changes, then the index is updated with new information. The search engine software, which is yet another part of the search engine, is a program that sifts through the pages recorded in the index to find documents fulfilling a search query submitted by a user. The search engine software will typically rank the matches in accordance with their relevance.
Once it is given a set of start addresses and restriction rules, a crawler can retrieve documents following all recursive links from the documents that correspond to the start addresses that pass the restriction rules. For example, a crawler may recursively follow all links from the documents that correspond to specified start addresses, limiting the URL space by filtering out those that do not pass the specified crawl restriction rules. The primary application of the crawler is to build an index of a set of documents, so that the index can be searched by end-users that want to locate documents that match certain search criteria.
A crawler can retrieve documents from different stores. Although the primary store is the Web, a crawler can retrieve documents from a mail store, database, or anything else that has textual content (but textual content is relevant only for processing of a document for the purpose of indexing, since the crawler itself is not concerned with about what type of document is being crawled).
Crawls typically are performed periodically to update the indexes with changed documents. Crawlers usually have no knowledge of the document store specifics. The only thing they can rely on is the last modified timestamp of the document, which is standard for most document stores, including HTTP servers, file servers, mail servers and databases. A problem with this approach is that, to ascertain the increment of the document set, the crawler must ask the corresponding server for each document whether the document""s timestamp has changed. Since the percentage of documents that are unchanged between crawls is typically very high, it would be beneficial to minimize the number of requests the crawler makes to the document server to obtain the xe2x80x9cincrementxe2x80x9d of the document set relative to the set of documents received during the previous crawl (i.e., to obtain new, modified and deleted documents). The present invention achieves this goal.
Further background information about Web crawlers is provided below, and may also be found in U.S. pending patent application Ser. No. 09/105,758, filed Jun. 26, 1998, xe2x80x9cMethod of Web Crawling Utilizing Crawl Numbers,xe2x80x9d and U.S. patent application Ser. No. 09/107,227, filed Jun. 30, 1998, and now U.S. Pat. No. 6,483,794 xe2x80x9cSynchronizing Crawler With Notification Source.xe2x80x9d
This invention provides an improved mechanism for maintaining a document store in a manner that facilitates an efficient determination of whether and how the document store has been xe2x80x9cincrementedxe2x80x9d or modified from a prior state. For example, the invention could be used in a Web crawler application, mail server, directory service, or any system requiring indexing or one-way replication of a document store. The invention is particularly directed to a method and system for identifying documents in a document store that have changed, are new, or have been deleted.
The present invention utilizes a document store""s ability to provide extra properties for each document folder. Such extra properties include, e.g., local commit time (LCT), maximum local commit time (MLCT) and deleted documents count (DDC). The crawler keeps track of local commit times per document URL. For folders, the crawler keeps the greater of the LCT and MLCT, as well as the DDC. It also keeps track of which URLs correspond to folders as opposed to documents, and for each URL it keeps the fact that a document was produced by the store that supports these extended properties (LCT, MLCT, and DDC). In an exemplary application of the present invention, a Web crawler creates an index of documents in a document store on a computer network, which may be an intranet, LAN or the Internet. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined xe2x80x9cseedxe2x80x9d URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a LCT for each document and a DDC and LCT or MLCT for each folder. Flags are also included in the History Table to indicate which URLs have a corresponding DDC (i.e., which are folders) and which URLs have a parent with a corresponding DDC. Thereafter, in an incremental crawl, the crawler proceeds in accordance with the History Table (e.g., by starting at a first URL corresponding to a folder, as identified by the flag or bit mask, and continuing down through each folder URL) and determines (1) whether the DDC for that URL has changed and (2) whether the MLCT or LCT is more recent than the corresponding value in the History Table. If the DDC has changed, the crawler obtains a full list of items (URLs) in that folder, and compares the list with the URLs in the History Table to identify the deleted documents. The deleted documents are then deleted from the History Table and index. If the MLCT is more recent, the crawler queries the document store for the URLs of linked documents (i.e., linked to that folder) having a LCT more recent than the MLCT or LCT in the History Table for the folder. The History Table and index are then updated accordingly to reflect the changes to the document store.
In an exemplary implementation of the invention, if the document store does not support the LCT, MLCT and DDC properties, the crawler falls back to performing a normal incremental crawl with checking every document""s timestamp. Also, a crawl can cover multiple stores, some of them supporting the extended properties, and some not. This means that the crawl history can be a mix of URLs from stores supporting or not supporting the extended properties. When doing an incremental crawl, the crawler iterates the history, and, for URLs produced from a store that supports extended properties, it only picks folders and performs the above-described procedure. Otherwise, if the store does not support the extended properties, it processes all URLs and checks their respective timestamps.
The invention avoids the need to check the time stamp for each and every document in the document store to identify changes to the document store. This dramatically improves the efficiency of incremental crawls and like processes that are used to manage document stores.
Other features of the present invention are described below.