The present invention relates to software networks and, in particular, to methods and systems for retrieving data from network sites.
In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A xe2x80x9cclientxe2x80x9d computer connected to the Internet can download digital information from xe2x80x9cserverxe2x80x9d computers connected to the Internet. Client application software executing on client computers typically accept commands from a user and obtain data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hypertext Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the xe2x80x9cGopherxe2x80x9d document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as xe2x80x9cthe Web.xe2x80x9d The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites located around the world that maintain and distribute electronic documents. A Web site may use one or more Web serves computers that store and distribute documents in one of a number of formats including the Hypertext Markup Language (HTML). An HTML document contains text and metadata such as commands providing formating information. HTML documents also include embedded xe2x80x9clinksxe2x80x9d that reference other data or documents located on any Web server computers. The referenced documents may represents text, graphics, or video in respective formats.
A Web browser is a client application or operating system utility that communicates with server computers via FTP, HTTP, and Gopher protocols. Web browsers receive electronic documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, Redmond, Wash., is an example of a popular Web browser application.
An intranet is a local area network containing serves and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
Web crawlers are computer programs that xe2x80x9ccrawlxe2x80x9d the World Wide Web in search of documents to retrieve. A Web crawler processes the received data, preparing the data to be subsequently processed by other programs. For example, a Web crawler may use the retrieved data to create an index of documents available over the Internet or an intranet. A xe2x80x9csearch enginexe2x80x9d can later use the index to locate electronic documents that satisfy specified search criteria. However, in order to keep the index current, the Web crawler must periodically return to documents that it has previously retrieved and update the index to reflect any changes made to the document. The interval between the time a document is revised on a Web server and the time until that document is revisited by the Web crawler produces a latency in the index. This latency means that the index may inaccurately reflect a Web document because the document may be modified between the time that a Web document is retrieved and the time that the Web crawler revisits that document. In order to keep the index current, i.e., maintain the latency of the index low, the Web crawler must regularly retrieve Web documents that have been changed.
While all such documents in an index could be checked for changes at regular intervals, this is a time consuming and somewhat inaccurate process. It is inaccurate because Web documents could be modified well before the Web crawler revisits the document to check for changes. It is more desirable to have a mechanism by which a Web crawler only returns to a Web document that it has previously retrieved when there has been a change made to that document. Preferably, such a mechanism would cause the Web crawler to revisit a document only when it has been notified that the document has been changed or when a mechanism monitoring the document has informed the Web crawler that it has experienced a discontinuity. The present invention is directed to providing such a mechanism.
In accordance with the present invention, a mechanism is provided for maintaining the synchronization between data stored on a computer network and a copy of that data stored in a local data store with minimal latency. The mechanism of the invention initially creates the data stored by performing a xe2x80x9ccrawlxe2x80x9d (recursively following inks, discussed below). The mechanism of the invention then maintains a synchronization between the data stored on the computer network and the copy of the data stored in the local data store by accepting direct notifications from notification sources monitor the data on the computer network that the data has changed. The mechanism of the invention also enables the efficient reestablishment of this synchronization, when necessary, the leveraging the ability of the gatherer to incrementally crawl the data stored on the computer network.
In an actual embodiment of the invention, the gather is an enhanced Web crawler that has one or more configuration entities called gathering projects. Each gathering project has its own transaction log, history map, and crawl restriction rules that a gatherer process uses to xe2x80x9ccrawlxe2x80x9d Web documents that are stored on a plurality of Web servers connected to the World Wide Web. When the gatherer process accesses a document, the gatherer process retrieves a copy of the content of the document, which may include data such as text, images, sound, and embedded properties. The data store preferably is an index that receives and stores the information contained in the copies of retrieved documents. As each Web document is processed, the document""s URL and timestamp are stored in a persistent history map. The history map is used in subsequent initialization crawls to revisit documents previously crawled and to retrieve only those documents that have changed since the last time that the gatherer retrieved a copy of the document. The data store is initially created during a first crawl and is undated during subsequent initialization crawls and notification retrievals.
In accordance with further aspects of the present invention, the gatherer process continuously monitors for a notification message sent by a notification source that is registered or listed in the gatherer project. The notification source monitors all or part of the computer network previously crawled by the gatherer process during the first crawl or a subsequent initialization crawl. When the gatherer process receives a notification message from a notification source listed in the gatherer project, the gatherer process places the address of the electronic document contained in the notification message into a notification log. The gatherer process retrieves a copy of an electronic document from each of the addresses listed in the notification log when it is in its notification retrieval mode. The document copy retrieved pursuant to the notification message is then used to update the information associated with the document that is stored in the document data store. A plurality of notification sources can monitor documents and asynchronously send notification messages to the same gatherer process.
In accordance with a still further aspect of the invention, the gatherer process maintains the synchronization between a notification source and the gatherer process by performing an initialization crawl whenever either a listed notification source or the gatherer process experiences a discontinuity such as a system shutdown or network disconnect. If the notification source experiences of discontinuity, the notification source requests that the gatherer process perform an initialization crawl by sending a message to the gatherer process. This initialization message is usually sent to the gatherer process soon after the notification source first starts to run (is instantiated). The gatherer process also performs a initialization crawl when it first begins to run (is instantiated).
The initialization crawl performed by the gatherer process of the present invention is an incremental crawl. The initialization crawl is xe2x80x9cseededxe2x80x9d by copying the addresses listed in the history map to the transaction log that the gatherer process uses to retrieve the documents. The gatherer process then selectively retrieves the documents in the transaction log (listed in the history map) and any documents that may be referenced in the documents retrieved. Documents are selectively retrieved if they either are not listed in the history may (discovered during the current crawl), or are listed in the history map and the document has an associated timestamp that is later than the timestamp stored for that document in the history map. The timestamp stored in the history map indicates the timestamp associated with the document the last time the document was retrieved by the gatherer process and is updated every time the document is retrieved. If the respective timestamps match, the current electronic document is considered to by unchanged from the last time the document was retrieved and fed to the data store (index) and is therefore not retrieved during the initialization crawl. Preferably, the comparison of timestamps is performed by sending a request to a server to transfer the current electronic document if the timestamp associated with the current electronic document is more recent than a timestamp included in the request.
When a document copy is retrieved, the copy of the electronic document is used to update the data store. This synchronizes the information associated with the document in the data store to current information associated with the electronic document as it is stored on the computer network. By performing this synchronization whenever the notification source or the gatherer starts up, any changes to the electronic documents that may have been missed by the notification source while it was not operating, or not received by the gatherer while the gatherer was not operating, are accounted for.
As will be readily appreciated from the foregoing description, the present invention relieves the gatherer process of the need to make periodic crawls to discover if already retrieved electronic documents have changed as long as the synchronization enabled by the mechanism of the invention is maintained or can be reestablished. Instead, the gatherer process retrieves only those documents that it has been notified have changed and only needs to perform an initialization crawl when a notification source or the gatherer has ceased to operate for a period of time. The ability of the gatherer process to perform an initialization crawl means that the notification source does not used to account for events that took place during its discontinuityxe2x80x94it need only send a message to the gatherer to reinitialize the synchronization by performing an initialization crawl to pick up any changes that the notification source might have missed. This allows for simplified notification sources, reduces the requirements and complexity of the notification source, and leverages the existing ability of the gatherer process to perform enumeration of the computer network.