The present invention relates to a method for embedding additional information, including text data, i.e., so-called electronic watermark information, in an electronic document, a method for preventing the destruction of such embedded information, a method for preventing the re-use of such embedded information, and a system therefor.
As a large amount of information can be distributed across the Internet, or by using CD-ROMs, businesses that provide services for the conduct of electronic searches and for the distribution of documents containing digital data have become important. To ensure the safe development of such businesses, techniques that can provide for the management of copyrighted material contained in digital documents that are distributed and that can protect the rights of owners are indispensable. Such techniques are also required by companies that wish to protect secret material contained in digital documents, and to find and trace routes along which secrets may have been leaked.
The techniques applied for managing copyrighted electronic data can be roughly broken down into the two techniques of access control, for which encryption and authentication are employed, and electronic watermarking. The aim of the first technique is to ensure that access to the contents of selected digital material is limited to those users who pay for the privilege, or to users whose employment of the material is controlled by a manager. The latter technique provides a function by which the secondary outflow of decoded data contained in digital documents can be prevented, or can be traced. These two techniques must be combined in order to provide for the rigorous management of copyrighted material.
Among the various types of media, there is a very large demand for the use of the electronic watermark technique for text data that are distributed in volume. However, since in pure text data there is little redundancy in the expression of information, it is very difficult to embed information that supplements the original contents, i.e., electronic watermark information. In xe2x80x9cProposal for the digital watermarking of PostScript and PDF documents,xe2x80x9d Ryujiro Shibuya, Yuichi Kaji and Tadao Kasa, SCIS98-9.2.E (prior art 1), Japanese Unexamined Patent Publication No. Hei 7-222000 (prior art 2) and Japanese Unexamined Patent Publication No. Hei 6-324625 (prior art 3), a technique is proposed whereby watermark information is embedded in the document description, to include appearance and layout, while the focus is on the fact that a page description form, such as PS (PostScript) or PDF (Portable Document Format), tends to be employed for the actual distribution of text data. In the above prior art, slight changes in line spacing and word spacing and in fonts are employed to embed information in documents.
However, it is difficult to use the above described conventional technique to manage copyrights, or to specify a route along which secrets may have been leaked, when the following two conditions are not satisfied.
1. Detection of a watermark in data contained in multiple documents can be performed only by a user who possesses a common detection key.
2. The technique is sufficiently robust that during ordinary distribution processing it can prevent format conversions and the destruction of material by an unauthorized user.
However, prior art 1 does not teach a specific detection method that can satisfy condition 1. And the method described in prior arts 2 and 3 requires a comparison with the original document data, except for a method for manipulating the base line of character lines. Since information for watermark detection must be recorded and managed for each document in which a watermark has been embedded, in a large system this method is difficult to use. None of the above methods supports the protection system for which a key is used (a system according to which only a key owner is permitted to detect a watermark).
As for condition 2, only a study of the re-scanning of printed data has been written in prior art 2, and no consideration has been given to rendering page description data sufficiently robust to prevent its own destruction. Actually, many page description formats are open to the public, and may be destroyed. For example, a watermark embedded in a line space by manipulating the base line can be easily destroyed by slightly adjusting the positioning of the individual lines and by maintaining a constant width. In addition, only pure text data in which no watermark information is embedded may be extracted from page description data and employed.
In Japanese Unexamined Patent Publication No. Hei 8-348426 (prior art 4), a method is proposed for embedding a watermark using the statistic property of two sequences of locations. Although the technique described in prior art 4 is not an invention related to the electronic watermarking of text, this technique satisfies condition 1, and as far as condition 2 is concerned, it enables sufficiently robust embedding when the locations are changed at random. However, it is difficult to adapt this technique to page description. If this technique is employed for page description, a method for designating the sequence of locations is not obvious, which differs from the embedding of a watermark in an image. An object whose location is to be adjusted must be uniquely identified when embedding a watermark. The page description constitutes a set of page description objects (characters or character strings), including positional information, and does not include information for identifying and ordering the individual elements. While, for an image, pixels and small domains can be specified by using X and Y coordinates, in page description an object whose location can be adjusted in a specific domain designated by the coordinates is not always present once a document and a page are changed. Since in page description the order in which objects are positioned on a page image does not affect the appearance of the image, the order in which the objects appear in a file format is of no help when an object is being specified. Actually, the order in which objects appear in a file may be changed as the result of format conversion or of an attempt by a third party to destroy them (an attack).
Furthermore, none of the above prior art examples provides for the resolution of a problem whereof only pure text data are extracted from page description data. Since the specifications employed for page description that are distributed across the Internet are open to the public, an adequate program need only be formed that can extract only text data mechanically. In addition, the display software for page description frequently supports the delivery of data to another program using CutandPaste. In this case, a common user can extract text. The PDF display software employs a password to control access permission and to prohibit the use of CutandPaste. In the current system, however, if printing is permitted, only a PDF- greater than PS- greater than PDF conversion (information for managing a password is omitted through the conversion into PS) need be performed to remove protection. For some applications, therefore, text may be extracted from page description data and illegally traded.
It is, therefore, one object of the present invention to provide a method and a system for embedding information in document data that include text written in a page description language.
It is one more object of the present invention to provide a method and a system for detecting embedded information in document data that include text written in a page description language.
It is another object of the present invention to provide a method, for embedding an electronic watermark in document data that include text written in a page description language, whereby a common detection key is employed to detect electronic watermarks in multiple documents, and a system therefor.
It is an additional object of the present invention to provide a method for embedding information in document data that is sufficiently robust to prevent a format conversion during common data distribution processing and during an attack mounted by an unauthorized user, and a system therefor.
It is a further object of the present invention to provide a method for embedding information in document data whereby an object for which the feature is to be operated can be uniquely identified, and a system therefor.
It is one further object of the present invention to provide a method and a system for preventing the extraction of text from page description data.
It is yet one more object of the present invention to provide a method and a system for embedding in document data, as a watermark, information that represents a copyright.
It is yet another object of the present invention to provide a method and a system for embedding information in document data, and for preventing, through mechanical processing, the removal from a document of an electronic watermark.
To achieve the above objects, first, an analysis is made of the layout of the document data in which information is to be embedded. Then, based on the analysis of the layout, a sequence of locations is generated whereat the information is to be embedded. A page description of the text at a designated location is changed in accordance with the embedded information. As a result, the information is embedded in document data that include text written in a page description language. The sequence of locations is generated by producing a string of sequential pseudo-random numbers.
When, for example, a statistic method (prior art 4) is employed to embed electronic watermark information in a page description language, such as PDF, two sequences of locations are designated based on a description of the layout structure of a document. Using the layout description, descriptive data can be provided by which a pair of designated locations can evidence a strong correlation with multiple documents, each of which has a different layout, and the reliability of the embedding process can be increased.
To detect information embedded in document data, first the layout of the document data in which information is embedded is analyzed. Then, a sequence of locations whereat the information is embedded is generated based on the analyzed layout. The embedded information is obtained from a page description entered for text at a determined location. The sequence of locations is generated by employing a string of sequential pseudo-random numbers used for embedding information.
To embed non-alterable information in document data, first a string of characters in which information is to be embedded is extracted from the text. Then, the extracted string of characters is broken down into smaller units. A page description that represents either a relative distance traveled from a reference point of the string of characters, or a relative distance traveled from a reference point of the previous character is changed in accordance with the information to be embedded. Here, when the original string of characters is broken down it means either that individual characters are extracted or that smaller strings are formed.
According to another aspect of the embedding of non-alterable information in document data, first, a layout of the data in a document in which information is to be embedded is analyzed. Then, one or more characters are selected from the analyzed layout. A font representing the selected character is created, and the page description is changed so that the font can be used to replace the selected character.
These methods are those whereby, when an object in a page description language is mechanically relocated, page description data are so constructed that an unnatural change in the appearance of the object occurs. With these methods, an unauthorized user is prevented from manipulating an object in page description data by performing one of the following processes: the division of a character object set that constitutes one line, and storage of the divided object segments at separate locations in a PDF format; the embedding of dummy information by using a method whereby the appearance of an object is changed if the embedded information is deleted; or the use of a single font to express two characters.
In order to prevent the extraction of embedded information from document data that include text written in a page description language, first, a character is selected from the text. Then, a graphic primitive is created that represents the selected character, and page description data is changed so that the graphic primitive can be used to replace the selected character.
One part, or all, of the coded text in the page description data is changed so that it can be written using a special character code system, and is replaced by using a graphic primitive for display and printing. So long as a dedicated browser that understands a page description language is employed, in addition to other characters, a character replaced using the graphic primitive can be displayed or printed. However, since that character is not coded as text, another application can not extract it from the page description data.