The present invention relates to the field of network information software and, in particular, to methods and systems for retrieving data from network sites.
In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A xe2x80x9cclientxe2x80x9d computer connected to the Internet can download digital information from xe2x80x9cserverxe2x80x9d computers connected to the Internet. Client application software executing on client computers typically accept commands from a user and obtain data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the xe2x80x9cGopherxe2x80x9d document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as xe2x80x9cthe Web.xe2x80x9d The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites located around the world that maintain and distribute electronic documents, A Web site may use one or more Web server computers that store and distribute documents in one of a number of formats including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata or commands providing formatting information. HTML documents also include embedded xe2x80x9clinksxe2x80x9d that reference other data or documents located on any Web server computers. The referenced documents may represent text, graphics, or video in respective formats.
A Web browser is a client application or operating system utility that communicates with server computers via FTP, HTTP, and Gopher protocols. Web browsers receive electronic documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, of Redmond, Washington, is an example of a popular Web browser application.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
Web crawlers are computer programs that retrieve numerous electronic documents from one or more Web sites. A Web crawler processes the received data, preparing the data to be subsequently processed by other programs. For example, a Web crawler may use the retrieved data to create an index of documents available over the Internet or an intranet. A xe2x80x9csearch enginexe2x80x9d can later use the index to locate electronic documents that satisfy a specified criteria.
A user that performs a document search provides search parameters to limit the number of documents retrieved. For example, a user may submit a search request that includes a list of one or more words, and the search engine locates electronic documents that contain a specified combination of the words. A user may repeat a search after a period of time. When a search is repeated, the user may prefer to avoid locating documents that have been located by prior searches.
It is desirable to have a mechanism by which a user can request a search engine to return only documents that have changed in some substantive way since that prior search. Preferably, such a mechanism will provide a Web crawler with a way to retrieve only documents that may have changed since a previous Web crawl and then to determine if an actual, substantive change has been made to the document. The mechanism would also preferably provide a way to mark the data retrieved from the document and stored in an index with an identifier that could be used in a search of the index to indicate when the Web crawler last found a substantive change to the document. The present invention is directed to providing such a mechanism.
In accordance with this invention, a system and computer based method of retrieving data from a computer network are provided. In an actual embodiment of the present invention, the method includes performing a Web crawl, by retrieving a set of electronic documents and subsequently retrieving additional electronic documents based on addresses specified within each electronic document. In a later Web crawl, electronic documents that have been modified subsequent to the previous Web crawl and electronic documents that were not retrieved during the previous Web crawl are retrieved. Electronic documents that were deleted since the previous Web crawl are detected. Each Web crawl is assigned a unique current crawl number. A crawl number modified is associated with and stored with the storage data from each electronic document retrieved during the Web crawl. The crawl number modified is set equal to the current crawl number when the document is first retrieved, or when it has previously been retrieved and has been found by the mechanism of the invention to have been modified in some substantive manner. In a subsequent search request, a crawl number can be retained as a search parameter and compared against a crawl number modified that is stored with the document data to determine if a document has been modified subsequent to the crawl number specified in the search.
In accordance with other aspects of this invention, each electronic document has a corresponding document address specification and provides information for locating the electronic document. During a Web crawl, document address specifications are used to retrieve copies of the corresponding electronic documents. Information from each electronic document retrieved during a Web crawl is stored in an index and associated with the corresponding document address specification and with a crawl number modified. If the retrieved document contains document address specifications to linked documents included in hyperlinks, these linked documents are also selectively retrieved during the Web crawl and processed in the manner described above.
In accordance with further aspects of this invention, performing a Web crawl includes assigning a unique current crawl number to the Web crawl, and determining whether a currently retrieved electronic document corresponding to each previously retrieved electronic document copy is substantively equivalent to the corresponding previously retrieved electronic document copy, in order to determine whether the electronic document has been modified since a previous crawl. If the current electronic document is not substantively equivalent to the previously retrieved electronic document copy, and therefore has been modified, the document""s associated crawl number modified is set to the current crawl number and stored in the index with the data from the current electronic document.
In accordance with still other aspects of this invention, a secure hash function is used to determine a hash value corresponding to each retrieved electronic document copy. The hash value is stored in the index and used in subsequent Web crawls to determine whether the corresponding electronic document is modified. The current electronic document is retrieved and used to obtain a new hash value, which is compared with the previously determined hash value corresponding to the associated document address specification that is stored in a history map. If the hash values are equal, the current electronic document is considered to be substantively equivalent to the previously retrieved electronic document copy. If the hash values differ, the current electronic document is considered to be modified and the current crawl number is associated with the newly retrieved electronic document as the crawl number modified. The crawl number modified indicates the crawl number of the last crawl in which the data in the document was found to have changed. The hash value is stored with the associated data from the retrieved document and stored in the index. Preferably, hash functions are applied to data from electronic documents after selected data has been filtered out, so that filtered out data is not represented in the hash values, and is therefore not considered in comparisons. For instance, formatting information contained in the retrieved document could be filtered out before the hash value is computed.
In accordance with further aspects of this invention, during an incremental crawl, prior to retrieving an electronic document copy, the time stamp of the current electronic document is compared with the previously stored time stamp of a previously retrieved electronic document corresponding to the current electronic document. If the respective time stamps match, the current electronic document is considered to be substantively equivalent to its corresponding previously retrieved electronic document copy, and is therefore not retrieved during the current incremental crawl. Preferably, the comparison of time stamps is performed by sending a request to a server to transfer the current electronic document if the time stamp associated with the current electronic document is more recent than a time stamp included in the request.
As will be readily appreciated from the foregoing description, a system and method formed in accordance with the invention for retrieving data from electronic documents on a computer network provide an efficient way of retrieving and storing information pertaining to electronic documents, wherein the retrieval of electronic documents that have previously been retrieved is minimized. The invention allows a Web crawler to perform crawls in less time and to perform more comprehensive crawls. Assigning a crawl number modified to a retrieved document that is set to the current crawl number when the document has been retrieved and found to have been modified in some substantive way since the last time it was retrieved by the invention or if it is the first time the document is retrieved advantageously reduces search and document retrieval time.
Storing the crawl number modified with the document data enables a user to perform a subsequent search using a crawl number as a search criteria. This allows a user to search only for documents that have substantively changed since a previous search. For instance, a user could run a first search requesting documents that meet a particular query. The intermediate agent that queries the search engine could retain the crawl number of the most recent crawl made by the web crawler along with recording the search query. A second search performed at a later time could run the same query as the first search, but with the intermediate agent implicitly adding the retained crawl number as a search criteria. The resulting search will only return documents with an associated crawl number modified that is subsequent to the retained crawl number. Because the crawl number modified associated with a document only changes when a subsequent Web crawl finds that it has changed in a substantive way, the second search would only return documents that have actually changed since the first search. The present invention offers other advantages over solely relying on the timestamp of the document to search for new documents. For instance, a search that requests only documents with a timestamp subsequent to the date of a prior search would not return any new documents found by the Web crawler but having timestamps that are earlier than the date of the last search.