The present invention is directed to a method for more efficiently indexing documents. More particularly, the present invention is directed to a method for efficiently and effectively indexing documents which by their nature are partially dynamic, that is change over time, at least in part.
The use of the Internet as an information resource continues to grow. More and more information sites or servers are connected to the Internet and information seekers conduct more and more searched in this unstructured database.
Within this arrangement a given server may serve a number of different sites. An example of a site which may be accessed by users is www.cnn.com. This site is associated with the Cable News Network. The site contains multiple pages. These pages are typically updated multiple times each day, as and when news events warrant.
It is already known to provide spiders, which on behalf of search engine servers will go out into the network on a periodic basis and retrieve documents, consisting of one or more pages, from one or more servers, and indexers, which index the retrieved documents. A problem arises where a document changes much more rapidly than the spider accesses the document to update the index. For instance, if the spider only accesses a document on a daily basis, but the document itself may change multiple times during the course of a day, then it is almost guaranteed that if the most recently retrieved and indexed version of the document is identified in a search operation it will be an incorrect match since the document itself will have changed since the last time it was indexed. Thus, there is a need to develop a technique to more effectively index these dynamic documents.
In addition, as to dynamic documents, typically indexing occurs with respect to the entirety of a document. In some circumstances only portions of a document may change rapidly while other, still useful portions change little at all. Nonetheless, it can happen that if a document changes more frequently than a certain threshold indexing will not be performed with respect to that document at all. Under those circumstances the indexer loses the benefit of retrieving and holding indexing information with respect to those portions of partially dynamic documents that do not change frequently. It would therefore be beneficial to provide some method for maximizing the information to be gleaned from partially dynamic documents.
The present invention is directed to a method for effectively indexing partially dynamic documents. In accordance with the method of the present invention an indexer keeps track of the characteristics of a document as it performs its indexing operation. For example, an indexer may retain a first copy of a document obtained during a first indexing operation. Then, after a predetermined time interval, a spider may retrieve a second copy of the document. The two copies of the document can then be compared by the indexer to determine the extent to which the documents differ. If the indexer determines that the differences are sufficiently significant, then the indexer recognizes that this dynamic document should be updated more frequently. As a result the indexer adjusts the predetermined time interval, reducing it, so as to retrieve a third copy of the document at a shorter time interval. This process will continue, that is, the time interval will be reduced so long as the differences between any two copies exceed the significance threshold. Alternatively, if the comparison between the first and the second documents indicates that there are no changes or that the changes are less than some insignificant threshold then the indexer may expand the time interval. By monitoring the amount of changes between copies of the documents and then adjusting the time interval with which these documents are retrieved the present invention more efficiently and effectively indexes partially dynamic documents.
In accordance with another aspect of the present invention, the indexer not only characterizes the significance of the differences between copies of the document in question but also notes the extent to which the document copies are similar to one another. The indexer can then use this similarity information to determine to index those portions of the document which have remained substantially constant over multiple copies while ignoring the dynamic or changing portions of the document in terms of the indexing operation. This indexing improvement permits the indexer to glean information from documents which may change frequently, but whose unchanging portions still provide significant information to potential users.
The present invention also includes combining these two concepts of adapting the time interval for indexing as well as adapting the selection of material to be indexed to further enhance the indexing efficiencies.
The present invention also can take into account the general usefulness of a document to others in making a determination as to how frequently to index the document and how much of the document should be indexed. The present invention thus provides an improvement over the indexing capabilities known in the prior art.