The present invention is directed to a method for more efficiently indexing documents. More particularly, the present invention is directed to a method for efficiently and effectively indexing documents which by their nature are partially dynamic, that is change over time, at least in part.
The use of the Internet as an information resource continues to grow. More and more information sites or servers are connected to the Internet and information seekers conduct more and more searches in this unstructured database.
Within this arrangement a given server may serve a number of different sites. An example of a site which may be accessed by users is www.cnn.com. This site is associated with the Cable News Network. The site contains multiple pages. These pages are typically updated multiple times each day, as and when news events warrant.
It is already known to provide spiders, which on behalf of search engine servers will go out into the network on a periodic basis and retrieve documents, consisting of one or more pages, from one or more servers, and indexers, which index the retrieved documents. A problem arises where a document changes much more rapidly than the spider accesses the document to update the index. For instance, if the spider only accesses a document on a daily basis, but the document itself may change multiple times during the course of a day, then it is almost guaranteed that if the most recently retrieved and indexed version of the document is identified in a search operation it will be an incorrect match since the document itself will have changed since the last time it was indexed. Thus, there is a need to develop a technique to more effectively index these dynamic documents.
In addition, as to dynamic documents, typically indexing occurs with respect to the entirety of a document. In some circumstances only portions of a document may change rapidly while other, still useful portions change little at all. Nonetheless, it can happen that if a document changes more frequently than a certain threshold indexing will not be performed with respect to that document at all. Under those circumstances the indexer loses the benefit of retrieving and holding indexing information with respect to those portions of partially dynamic documents that do not change frequently. It would therefore be beneficial to provide some method for maximizing the information to be gleaned from partially dynamic documents.