Protecting sensitive or valuable information has always been a key requirement of modern civilizations. The rapidly growing multimedia market and use of digital technologies, networks and computers recently revealed unprecedented security threats leading to an urgent need for securing documents. Among the major security issues initially pointed out, the most important are: the ease with which exact copies of digital content can be made without authorization; the effectiveness with which high quality counterfeits can be made with usual document editing tools; and the ease with which the true originator or owner of a document can be faked for fraudulent purpose. However, despite the general diffusion of modern technologies and multimedia, textual content and hardcopy documents still remain the most common, habitual and widely used information carrier in many everyday scenarios. Moreover, while similar security issues exist for textual or printed documents as for electronic media, counterfeiters and criminal individuals or organizations benefit from the progress in digital imaging technologies, as well as from the non-technical expertise of different consumers or actors of the society. Then digital watermarking has been considered in the years 1990 as a possible approach able to address the above problems, first mostly tackling the case of digital multimedia content such as sound, images and video, and targeting copyright protection, authentication and integrity control.
Digital Watermarking of Visual Content
Numerous digital watermarking techniques have been developed for electronic row or bitmap visual content (mostly still images, but also video), which process such content as continuous-tone grayscale or color visual information. These schemes mostly target copyright protection applications [1] [2] [3] [4] [5], as well as tamper proofing and authentication with localization capabilities [6] [7] [8] [9]. Other approaches specifically address printed media, usually by interacting with the image halftoning process used by most common printers and called halftone or dither modulation [10] [11]. A highly robust multi-resolution self-reference watermarking scheme for images has been designed [12], which includes recovery from linear [13] and non-linear [14] geometrical distortions, and image authentication and tamper proofing with localization [15]; this technology is also covered by several patents [16] [17] [18].
Text Document Watermarking
However, protecting physical documents containing textual information has clearly become an issue of high importance, since printed material has been proven to be a direct accessory to many criminal and terrorist acts; some examples of such documents are identity authentication and transaction documents, which today are easy to forge or to tamper using modern technologies. Visual content watermarking schemes mentioned above can resist to printing and rescanning, but at the price of very low data-hiding embedding rate. Such watermarks also result into a significant visual quality degradation of protected material in the case of computer-generated “artificial” images, which comprise synthetic images, flowcharts and industrial drawing, and also text and more generally vector graphics.
Row Image Document Watermarking
Early proposal for text document watermarking was to use one of the row visual watermarking schemes above, and to apply it on a document converted to a high-resolution image in order to overcome the cited visual quality problem [3] [5]. In this context we can mention the work of Bhattacharjya and Ancin [19] [20] where selected pixels are modified in text and characters areas in order to hide the information; Rhoads patent proposes to embed a watermark in pictures elements of a document [21], and Koltai patent modifies the printed dither patterns that can be observed with magnifying lens in order to encode the message [22]. However the use of high resolution leads to the necessity to handle large amounts of data, requiring more computation and memory as well as high quality printers and scanners, conditions which may be costly and thus difficult to achieve in real world applications.
Hard-Copy Robust Text-Based Document Watermarking
Therefore approaches which are more text specific than pixel-wise or point-wise modulation have been developed, which hide information by changing characteristics surviving through the printing/rescanning process. One popular class of techniques works by slightly modifying position or geometry related features of printed characters, lines or text, without altering the content of the text itself. These characteristics can be: relative positions or sizes of characters or lines; character fonts; spacing between characters, lines, or words; or margin alignment [23] [24] [25] [26] [27] [28] [29] [30]. For example, Alattar and Alattar use words/lines spacing modulation [23]; Brassil et al use positions shifting of groups of words or lines for coding [24] [25] [26] [27]. We can mention also Huang and Yan work, which represents hidden information as sine waves in the average inter-word distance [31]. Techniques like spread spectrum and error correcting codes are often used to achieve some robustness against document reformatting. Kim et al [32] try to achieve robustness by classifying words relatively to some features, grouping adjacent words into segments which are also classified, and encoding the information by modifying some statistics of inter-word spaces within these segments. A second major class of text watermarking schemes consists of semantic-based watermarking which modifies the content of the text itself, by replacing words or sentence by semantic equivalents or synonyms. Purdue university team developed a scheme which embeds information in the syntax or grammatical structure of natural language [33].
Concerning these two classes of techniques, we can remark that position/geometric based techniques are not always suitable for generation in common document editing tools. There are intrinsically sensitive to document reformatting (especially in electronic version before printing), or to rescanning and rearranging/retyping. Redundancy-based techniques like error correcting codes (ECC) or spread-spectrum methods are used to achieve some robustness against these kinds of attacks. In contrary, text substitution techniques like the approach [33] are robust against reformatting, but they are language-dependent, need natural sentence processing and large dictionaries, and are not acceptable in many scenarios where the exact document content should be preserved. Finally, most of text watermarking schemes can achieve low data-hiding rates with one bit per word, per line, or per sentence only.
Soft-Copy Document Watermarking
Soft-copy watermarking schemes work by modulating features in an electronically encoded document, but which are neither displayed nor printed. This can consist in inserting characters with the same color as the background, extra space and back-spaces, or any other invisible attributes or even additional text but tagged to be invisible. Examples of soft text document watermarking are given by Carro [34] and Turner and Manikas [35] patents. We can also mention the patent of Hirayama et al [36] which proposes to modify features like pitch between designated characters of electronic documents.
Although high data-hiding rates can be achieved (since an arbitrary amount of invisible features can in principle be added within the electronic format), this approach is format specific, is usually not robust to format conversion, and thus is not applicable for printed documents. This approach is very similar to the invisible Internet “bugs”, which are sometimes hidden in web pages for advertisement purpose and user surfing habits tracking. This approach can be assimilated to adding headers to electronic documents, which do not survive digital to analog conversion.
Other Techniques for Printed Document Protection
In this review of existing document watermarking schemes we can still mention more sophisticated techniques, like Mikkilineni et al's work [37] which hides extrinsic information by interacting with the physical printing process, and use the printer-dependant physical defects as an intrinsic signature, principally for forensic purpose. Noticing the fact that text document, as images, are essentially dual-tone and present high frequency components, Liu et al [38] explore the combination of spatial domain watermarking (shifting word, lines, etc.) with frequency domain detection: the document needs conversion into a high-resolution image but at the decoder side only. Finally, a last but not the least method is described in the patent of Jordan, Meylan and Kutter [39], hiding information in a printed document by placing sparse and imperceptible tiny dots on the sheet or paper based on a key-dependent pseudo-random spatial disposition. The watermark, which actually consists in a set of points, is not embedded into the original electronic version of the document, and is typically added in a second pass to an already printed copy. This approach possesses a good compromise between data-hiding rate and robustness. However, the necessity to perform two times passing considerably constrains its practical usability; the creation of a “sparse barcode” image can be hardly incorporated and stored directly in document editing tools.