Detecting and determining the existence of text plagiarism is complex and difficult. This complexity and difficulty increases in direct proportion with the amount of available text documents. This is the age of electronic commerce, with so-called “e-books”, Internet, HTML, e-mails, cube-classrooms and textbooks, electronic publishing, electronic fixing, scanning, Portable Document Format (idf) documents, Web pages, newspapers on-line, optical character recognition (OR), “cut-and-paste”, and pay-per-chapter electronic publishing, etc. Becoming common place. In this age, text, copies of text, copies of copies of text, etc. Fly across the world in a matter of seconds.
In this age, it is thoughtlessly commonplace to electronically copy text and do so instantaneously with a click of a button. It is exceedingly easy to duplicate wholesale (or significant) portions of text documents. This task requires no more technical expertise than the ability to press a button or press CTRL-V (to complete the “cut-and-paste” operation).
Plagiarism
However, just because it is easy to do something, does not make it right. Although it is easy for a person to copy an author's work and pawn it off as his own, it does not make such action right. Such action is commonly called “cheating” or “plagiarism.” Thus, a person engaging in such action is a “cheater” or a “plagiarize.” Since most contemporary works are copyrighted (either automatically or upon registration), a plagiarize is also infringing such copyrights and is subject to civil and possibly criminal penalties.
Why would a plagiarize take action that is socially unacceptable, deceitful, and likely illegal? It is easy for the plagiarize to do and it is unlikely for him to be caught.
A plagiarize realizes that authorities must compare the pilfered words in his work with oceans of words, phrases, quotes, chapters, books, and other works. These oceans are vast and deep. The oceans include text found in all of the libraries, bookstores, web sites, manuals, textbooks, e-mails, etc. Of the whole world.
Catching a plagiarize is a daunting task indeed. Typically, if an investigative authority does not have a lead for a place to look, it is nearly impossible. However, one tool that makes the investigation easier is an electronic database (or index) of text that has been recorded electronically.
To avoid capture, a plagiarize may simply change a few token words, punctuations, pagination, text order, insertion of new text, and/or format in the text documents. Meanwhile, the true authors and publishers of the substantive content of the plagiarized work are robbed of well-deserved credit and/or royalties.
Conventional Efforts to Detect Plagiarism
Much effort has been directed towards protecting images, audio, and video by either embedding a hidden watermark and/or generating a mathematical representation of such content. Much of this effort is geared towards detecting identifiers within the content even after the signals have been modified (intentionally or purposefully). Such identifiers may be inserted into the content or be inherent in the content.
Generally, these conventional techniques may insert an imperceptible change in multimedia (such as audio or video). Alternatively, these techniques determine an inherent characteristic of a work. These conventional techniques rely on the foundation that the code/inherent characteristic cannot be detected without access to secret knowledge (such as a cryptographic key) and is unalterable without noticeably altering the content.
However, these conventional techniques have not been directed toward protecting text because they do not apply to text. They don't apply to text because these conventional techniques generally require a perceptual change to the original content or they are easily thwarted.
For example, the concept of embedding a watermark into an image or audio signal does not apply to text because embedding a watermark would significantly alter the content—unless, of course, the author inserts it. That alteration would be clearly perceivably noticeable. A mathematical representation of text is easily thwarted by changing a few token words, punctuations, insertion of new text, pagination, text order, and/or format in the text documents
Side-by-side Text Comparison Approach. A side-by-side comparison of suspect text and possibly original text is an existing technique for detecting a copy of an original text. However, it can be easily thwarted by reordering text, adding text, and changing unessential text. If a comparison is done manually, a person may overlook such obfuscation tactics and see through to the similarity (which may amount to plagiarism). However, a comparison of electronic documents by a computer is not so forgiving.
With the emergence of so-called e-books, the problem of protecting text is becoming more important. E-books refer to the electronic distribution of electronic text. It is an alternative commercial publication technique.
Although such e-book mechanisms include cryptographic locks, such locks can be picked. Although no conventional technique is available, it would be helpful to determine if a subject body of text is substantially similar to an original text.
Content Categorization
Like plagiarism, categorizing the content of a text-based work often requires a subjective comparison of existing works. Works of similar nature are grouped into the same category. Text-based works may be classified into any number of categories, such as mystery novels, math textbooks, non-fiction books, self-help books, commercial web pages, poetry, and the other such works.
Typically, such categorization is subjectively determined by manual (i.e., human) subjective analysis of a work so that it may be grouped with an existing category. No such technique exists for automatically (i.e., without substantial human involvement) analyzing and categorizing a text-based work.