The threat of so-called “pirating” of digitally-formatted works has been a significant obstacle to the adoption and widespread use of the Internet for distribution of media such as books, musical works, and motion pictures. Although such network distribution would at first glance seem ideal for these media, which are easily represented in electronic format, there has been no easy way to prevent widespread copying of such works once they are introduced on the public network. In many cases, one person will buy a legitimate copy and then distribute further copies to friends and others without any further payments to the publisher. This threatens the financial well-being of the publishers, and makes them very reluctant to introduce their works on the Internet.
One approach to solving this problem is to provide content protection mechanisms. For example, songs might be distributed encrypted, with the decryption key hidden from the user.
Encryption, however, does not completely solve the problem. Rather, it merely makes the original content more difficult to recover. Even in the face of encryption, a user might discover the decryption key and distribute the original work, unencrypted. Even more simply, a song might be captured after it is decrypted and converted to analog, resulting in only a small reduction in quality. Similarly, a digitally-formatted book might be viewed and simultaneously retyped to create a new, unencrypted version of the book.
With video and audio, some progress has been made in the use of so-called “watermarking,” in which a known pattern of digital “noise” is introduced to the sequential samples of a digital data stream. The amplitude of this noise is designed to be quite small, so that it does not degrade the audio or video quality in any perceivable way.
Through the user of watermarks, publishers can verify their ownership of given works. In addition, different watermarks can be used with different copies of the same work, thereby allowing the publisher to trace a pirated work back to its original source.
Although watermarking can be effective with audio and video, it is not easily adaptable to text. This is because text generally becomes unreadable in the presence of even the smallest noise in the data representing the text—a 1-bit noise element changes a given letter to a completely different letter. Although there is some redundancy in formatted text—for instance, in the formatting itself—such redundancy can be easily removed and reinserted, meaning that it is not useful for holding watermarks. Thus, watermarking has not been used successfully in conjunction with textual works.
Furthermore, it is not well understood the extent to which watermarking—even in audio and video—can be overcome by simply playing the original work in analog format and re-recording the work from the analog presentation. Such an attack may have the potential to erase or otherwise degrade the watermark.
Another method of detecting copy violations involves actually searching the Internet for documents containing significant portions of protected works. This can be facilitated by the use of so-called “sketches” of textual matter, described in an article entitled “Syntactic Clustering of the Web,” by Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig, in Proceedings of the Sixth International World Wide Web Conference, April, 1997, pages 391–404. Using this scheme, a sketch is prepared of each work that is to be protected. A sketch is simply a list of hash values, wherein each hash value is created based on a different textual string of the base text. Each such string preferably encompasses a number of words, such as a sentence, paragraph, or some arbitrary number of characters. In the embodiment described in the article, a document is broken into a number of overlapping text segments or substrings, and a hash is calculated for each segment. The twenty smallest hash values are then chosen, and stored to create a sketch. Sketches of documents found on the Internet are then compared with the sketches of the works to be protected to determine whether some are substantially the same. Documents and works are considered the same if more than a given number of their twenty hash values match.
Although “sketches” such as described above allow a more efficient comparison of documents, the described method still requires that potentially violating works either be known ahead of time, or that the publisher undertake costly Internet searching. Furthermore, this method makes it impossible to find illegal copies that are not visible in an Internet search-such as copies that are e-mailed rather than distributed on publicly accessible Internet sites.
The scheme described below addresses some of the shortcomings of these prior methods, in a system that is effective and easy to implement.