Documents are increasingly being represented as digital bits of data and stored in electronic databases. These documents often appear as electronic versions of newspapers, magazines, journals, encyclopedias, books, and other printed materials. Such electronic "texts" can be comprised of miscellaneous strings of characters, words, sentences, paragraphs, or documents of indeterminate or varied lengths and may include a wide variety of data classifications, such as alpha-numerics, symbols, graphics, or bit sequences of any sort. Passages from these electronic texts can be accessed through the use of computers and further republished with astonishing ease and expediency.
Authors and publishers place considerable proprietary value on the textual passages they generate (e.g., newspaper and magazine articles). However, the ease in which textual passages can be duplicated in electronic storage media presents the problem that such passages can be copied and/or incorporated into larger documents without proper attribution or remuneration to the original author. This duplication can occur either without modification to the original passage or with only minor revisions such that original authorship cannot reasonably be disputed.
To guard against the unauthorized republication of such passages, authors and publishers desire an ability to search for their original work in a document database--such as the internet, LEXIS.RTM. NEXIS.RTM., DIALOG.RTM., and the like--for the purpose of locating specific instances where unauthorized republication has occurred. Similarly, publishers have a compelling need to ensure that all manuscripts that have been submitted for publication are, in their entirety, original works of authorship. Academic institutions, too, may wish to verify student theses and dissertations to confirm that they do not contain instances of plagiarism before academic credit for the writing can be awarded.
Also, authors and researchers often have a need to locate the source of a given passage but frequently do not know the title, author, date of publication, or other identifying feature of the original work. Unless the user has an exact quotation, it can be very difficult to find the source of the passage in order to give proper recognition to the original author. By enabling the author or researcher to efficiently compare the passages of a given text with documents published elsewhere, the process of finding the original work is significantly enhanced.
These examples highlight the need for an ability to efficiently locate and retrieve similar or identical passages appearing in other texts contained in electronic storage media. To locate and retrieve these passages under conventional document retrieval techniques, users may attempt to utilize a "keyword" or query term search. Under this method, every document existing in the database being searched that contains the keyword or query term selected by the user can be retrieved. This, however, is a very ineffective search technique for comparing passages because the user can easily become overwhelmed with enormous numbers of retrieved documents, most of which will have no relation to the user's particular inquiry.
Another method for locating and retrieving similar or identical passages may be through the use a Boolean search. A Boolean search involves searching for documents containing more than one keyword. This is typically accomplished by joining keywords with conjunctions, such as "AND" and/or "OR". If two or more keywords are joined by an AND, only those texts that contain all the keywords will be identified. If two or more keywords are joined by an OR, all texts that contain at least one of the joined keywords will be identified.
Unfortunately, keyword and Boolean search and retrieval techniques possess many weaknesses. One disadvantage associated with these methods is that the user must anticipate which specific keywords will identify and distinguish relevant texts. If the user fails to select the appropriate keywords or performs a Boolean search that is too restrictive, highly relevant texts might not be identified and thus will be overlooked. The user may not perceive the effects of a high false-negative rate and could become wrongly convinced that the search was successful despite likely missing the very best documents.
A similar disadvantage with keyword and Boolean searches is that a poorly designed query can potentially result in the identification too many documents that satisfy the user's search criteria. This can occur if a selected keyword is too common and/or the user heedlessly employs the conjunction OR to join multiple keywords in a Boolean search. If too many documents are retrieved, the user must expend much time and energy to tediously review each document and extricate the truly relevant documents from the vast collection of those identified as potential matches. Hence, a user frequently must select different keywords (and combinations thereof) in a costly and time-consuming iterative process to either broaden or narrow the search request.
More significantly, although these techniques may inform the user about the presence or absence of specific terms in a given text, they do not provide any insight regarding the actual sequence in which those terms appear in that text. As such, these search and retrieval techniques are not effective for finding strict sequences of information in a given set of documents. When a user is considering such matters as unauthorized republication or plagiarism, the information sought to be extracted from the database goes beyond the mere co-presence of terms or the appearance of a few terms (e.g., noun phrases) in the same order.
More recent text retrieval methods such as vector-space approaches afford more freedom to the user through the implementation of advanced search techniques such as query-term frequencies and similar statistical analyses. However, the principal focus of such techniques is to retrieve documents that most likely epitomize the main concepts associated with the user's search query; as in keyword and Boolean searches, little or no effort is made to actually compare sequential information embodied in specific textual passages. As such, vector-space retrieval techniques are, by themselves, relatively ineffective methods for locating and retrieving similar or identical passages occurring within a database of documents.
One technique that might be utilized to compare sequential information among two or more documents is to perform a sequential string search on all of the documents appearing in the database being searched. A sequential string search examines each document word-by-word to determine whether a string of words matching the string of words in the query exists. Typically, however, users do not know where the starting and ending points of matching strings will occur in the documents being searched.
Consequently, users are forced to scrupulously examine every word of every document in the entire database to determine whether a matching string exists. This can be an extremely slow and inefficient operation, particularly when the database being searched is large and when the known passage being matched against the database is only a few words long.