This invention relates to the field of computerized information search and retrieval systems. More particularly, this invention relates to a method and apparatus for enabling the user to efficiently locate and retrieve similar or identical passages occurring within a document database.
Documents are increasingly being represented as digital bits of data and stored in electronic databases. These documents often appear as electronic versions of newspapers, magazines, journals, encyclopedias, books, and other printed materials. Such electronic xe2x80x9ctextsxe2x80x9d can be comprised of miscellaneous strings of characters, words, sentences, paragraphs, or documents of indeterminate or varied lengths and may include a wide variety of data classifications, such as alpha-numerics, symbols, graphics, or bit sequences of any sort. Passages from these electronic texts can be accessed through the use of computers and further republished with astonishing ease and expediency.
Authors and publishers place considerable proprietary value on the textual passages they generate (e.g., newspaper and magazine articles). However, the ease in which textual passages can be duplicated in electronic storage media presents the problem that such passages can be copied and/or incorporated into larger documents without proper attribution or remuneration to the original author. This duplication can occur either without modification to the original passage or with only minor revisions such that original authorship cannot reasonably be disputed.
To guard against the unauthorized republication of such passages, authors and publishers desire an ability to search for their original work in a document databasexe2x80x94such as the internet, LEXIS(copyright)NEXIS(copyright), DIALOG(copyright), and the likexe2x80x94for the purpose of locating specific instances where unauthorized republication has occurred. Similarly, publishers have a compelling need to ensure that all manuscripts that have been submitted for publication are, in their entirety, original works of authorship. Academic institutions, too, may wish to verify student theses and dissertations to confirm that they do not contain instances of plagiarism before academic credit for the writing can be awarded.
Also, authors and researchers often have a need to locate the source of a given passage but frequently do not know the title, author, date of publication, or other identifying feature of the original work. Unless the user has an exact quotation, it can be very difficult to find the source of the passage in order to give proper recognition to the original author. By enabling the author or researcher to efficiently compare the passages of a given text with documents published elsewhere, the process of finding the original work is significantly enhanced.
These examples highlight the need for an ability to efficiently locate and retrieve similar or identical passages appearing in other texts contained in electronic storage media. To locate and retrieve these passages under conventional document retrieval techniques, users may attempt to utilize a xe2x80x9ckeywordxe2x80x9d or query term search. Under this method, every document existing in the database being searched that contains the keyword or query term selected by the user can be retrieved. This, however, is a very ineffective search technique for comparing passages because the user can easily become overwhelmed with enormous numbers of retrieved documents, most of which will have no relation to the user""s particular inquiry.
Another method for locating and retrieving similar or identical passages may be through the use a Boolean search. A Boolean search involves searching for documents containing more than one keyword. This is typically accomplished by joining keywords with conjunctions, such as xe2x80x9cANDxe2x80x9d and/or xe2x80x9cORxe2x80x9d. If two or more keywords are joined by an AND, only those texts that contain all the keywords will be identified. If two or more keywords are joined by an OR, all texts that contain at least one of the joined keywords will be identified.
Unfortunately, keyword and Boolean search and retrieval techniques possess many weaknesses. One disadvantage associated with these methods is that the user must anticipate which specific keywords will identify and distinguish relevant texts. If the user fails to select the appropriate keywords or performs a Boolean search that is too restrictive, highly relevant texts might not be identified and thus will be overlooked. The user may not perceive the effects of a high false-negative rate and could become wrongly convinced that the search was successful despite likely missing the very best documents.
A similar disadvantage with keyword and Boolean searches is that a poorly designed query can potentially result in the identification too many documents that satisfy the user""s search criteria. This can occur if a selected keyword is too common and/or the user heedlessly employs the conjunction OR to join multiple keywords in a Boolean search. If too many documents are retrieved, the user must expend much time and energy to tediously review each document and extricate the truly relevant documents from the vast collection of those identified as potential matches. Hence, a user frequently must select different keywords (and combinations thereof) in a costly and time-consuming iterative process to either broaden or narrow the search request.
More significantly, although these techniques may inform the user about the presence or absence of specific terms in a given text, they do not provide any insight regarding the actual sequence in which those terms appear in that text. As such, these search and retrieval techniques are not effective for finding strict sequences of information in a given set of documents. When a user is considering such matters as unauthorized republication or plagiarism, the information sought to be extracted from the database goes beyond the mere co-presence of terms or the appearance of a few terms (e.g., noun phrases) in the same order.
More recent text retrieval methods such as vector-space approaches afford more freedom to the user through the implementation of advanced search techniques such as query-term frequencies and similar statistical analyses. However, the principal focus of such techniques is to retrieve documents that most likely epitomize the main concepts associated with the user""s search query; as in keyword and Boolean searches, little or no effort is made to actually compare sequential information embodied in specific textual passages. As such, vector-space retrieval techniques are, by themselves, relatively ineffective methods for locating and retrieving similar or identical passages occurring within a database of documents.
One technique that might be utilized to compare sequential information among two or more documents is to perform a sequential string search on all of the documents appearing in the database being searched. A sequential string search examines each document word-by-word to determine whether a string of words matching the string of words in the query exists. Typically, however, users do not know where the starting and ending points of matching strings will occur in the documents being searched. Consequently, users are forced to scrupulously examine every word of every document in the entire database to determine whether a matching string exists. This can be an extremely slow and inefficient operation, particularly when the database being searched is large and when the known passage being matched against the database is only a few words long.
It is an object of the present invention to provide a text location and retrieval system.
It is another object of the invention to provide a text location and retrieval system that allows the texts of different documents to be compared for the purpose of locating similar or identical passages.
It is still another object of the invention to provide a text location and retrieval system that compares the texts of different documents in a minimal amount of time.
It is a further object of the invention to provide a text location and retrieval system that enables the user to determine whether a known document (or portions thereof) has been republished elsewhere.
The present invention provides a method and apparatus for locating and retrieving similar or identical passages among different documents. Toward this end, this invention uses discourse structures along with content attributes to form encoded xe2x80x9cmarker sequencesxe2x80x9d that collectively give a characterizing xe2x80x9csignaturexe2x80x9d to a known textual passage. These marker sequences substantially reduce the total amount of information in the passage while still permitting the encodings to be evaluated against a database of similarly encoded (and therefore similarly reduced) documents to identify candidate documents that contain similar or identical passages.
This computer-implemented method and apparatus for retrieving similar and identical passages from database documents incorporates the steps of inputting a known passage into a processing device, converting the known passage into a plurality of first marker sequence encodings, converting the database documents into a plurality of second marker sequence encodings, and evaluating the first marker sequence encodings against the second marker sequence encodings to identify candidate documents. The known passage can further be compared with the candidate documents using a sequential string search of either (1) the first marker sequence encodings against the second marker sequence encodings, or (2) each word contained in the known passage against each word contained in the candidate documents.
These and other aspects and advantages of the present invention will become better understood with reference to the following description, drawings, and appended claims.