Regardless of the search technology that is employed, most conventional search systems follow the same basic procedure for indexing and searching a database in a digital library. First, the data to be searched must be input to the search system for indexing. Next, attributes or contents or both are extracted from the objects and processed to create an index. An index contains data that is used by the search system to process queries and identify relevant objects. After the index is created, queries may be submitted to the search system. The query represents information needed by the user and is expressed using a query language and syntax defined by the search system. The search system processes the query using the index data for the database and a suitable similarity ranking algorithm. From this, the system returns a list of topically relevant objects, often referred to as a Ahit-list@ The user may then select relevant objects from the hit-list for viewing and processing.
A user may also use objects on the hit-list as navigational starting points. Navigation is the process of moving from one hypermedia object to another hypermedia object by traversing a hyperlink pointer between the objects. This operation is typically facilitated by a user interface that displays hypermedia objects, highlights the hyperlinks in those objects, and provides a simple mechanism for traversing a hyperlink and displaying the referent object. One such user interface is a Web browser. By navigating from one object to another, a user may find other objects of interest.
In a network environment, the components of a text search system may be distributed across multiple computers. A network environment includes as a minimum two or more computers connected by a local or wide area network, (e.g., Ethernet, the telephone network, and the Internet). A user accesses the hypermedia object database using a client application on the user=s computer. The client application communicates with a search server (e.g., a hypermedia object database search system) on either the computer (e.g., the client) or another computer (e.g., one or more servers) on the network. To process queries, the search server needs to access just the database index, which may be located on the same computer as the search server or yet another computer on the network. The actual objects in the database may be located on any computer on the network. These types of systems and search processes are all well known in the computing and database arts.
A Web environment, such as the World Wide Web (WWW) on the Internet, is a network environment where Web servers and browsers are used. Having gathered and indexed all of the documents available in the collection, the index can then be used, as described above, to search for documents in the collection. Again, the index may be located independently of the objects, the client, and even the search server. A hit-list, generated as the result of searching the index, typically identifies the locations and titles of the relevant documents in the collection, and the user may then retrieve those documents directly with the user=s Web browser.
One of the continuing problems in information retrieval is related to the fact that in the Web environment, there are a large number of near-duplicate documents returned from most searches. A number of methods have been proposed for recognizing and eliminating such duplicates.
For example, Eric W. Brown and John M. Prager in U.S. Pat. No. 5,913,208 note that documents having identical metadata such as size, date, and base filename are likely to be copies kept on different directories or on different servers, and can effectively be reduced to one single occurrence.
Another system was described by Andrei Z. Broder, AIdentifying and Filtering Near-duplicate Documents,@ Combinatorial Pattern Matching, 11th Annual Symposium, Montreal, Canada, June, 2000., in which regions of each document, called Ashingles@, are each treated as a sequence of tokens and then reduced to a numerical representation. These are then converted to Afingerprints@ using a method originally described by M. O. Rabin, AFingerprinting by random polynomials”, Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
At an even more simplistic level, an algorithm has been described for detecting plagiarism in which one simply searches for matches of six or more successive words between two documents.
As should be apparent, a need exists to provide an accurate and efficient algorithm and system for determining a degree of likeness between electronically represented documents.