Most companies and government agencies have a dire need for protecting sensitive information from falling into the wrong hands. Encryption, access restriction, and locking documents behind firewalls are some common techniques for protecting sensitive information. Encryption is an effective way for preventing an unauthorized person from viewing the content of a sensitive document. Nonetheless, once the document is decrypted for viewing using the secret key, an ill intentioned authorized person may save, copy, print, or transmit the unencrypted document anywhere he wants without any major difficulty. Restricting access of a document to only a few individuals works well with trustworthy individuals. But unfortunately, it is not uncommon to find secret documents circulating outside their trusted rings or even in the public media. In this case, determining who the untrustworthy person is has often proven to be neither an easy nor a pleasant task. Firewalls are an effective means to ban casual outsiders from accessing an organization's network. They also make it difficult for a savvy computer hacker to break in. Unfortunately, they cannot prevent an insider from copying a sensitive document onto a floppy disk or emailing it to an outsider using a third party internet service provider to avoid tracking.
Obviously, a comprehensive solution for insuring the security of sensitive documents cannot rely on a single technique. Instead, it must employ all the aforementioned techniques and further utilize a way to fingerprint the document. Fingerprinting here refers to the embedding of an indiscernible identifier in the document, in order to identify the owner or the recipient of the document. In this case, it does not mean cryptographically hashing the document into a signature that can be used to uniquely identify the document, although his form of hashing can form part of the data embedded in the document and can be used to detect alteration of the text. The embedded identifier can be detected and decoded from a fingerprinted document whenever and wherever the document is encountered, even after DA/AD conversion (print and scan). This can be achieved through the use of digital watermarking techniques.
Image watermarking techniques can be applied to a text document, but depending on the scheme, they can introduce a noticeable white noise in the text document. This noise is due to the binary (black and white) nature of the text document and its large white background. Moreover, image watermarking techniques are applicable to image bitmaps, which implies that the text document must first be rasterized into an image before being embedded with an image watermark.
Techniques for embedding data in text have been described in the literature. Examples include:    1. Q. Mei, E. K. Wong, and N. Memon, “Data Hiding in Binary Text Documents,” Proceedings of the SPIE, Security and Watermarking of Multimedia Contents III, vol. 4314, San Jose, Calif., January 2001, pp. 369-375.    2. J. T. Brassil, S. Low, N. F. Maxemchuk, and L. O'Gorman, “Electronic Marking and Identification Techniques to Discourage Document Copying,” IEEE Journal on Selected Areas in Communications, vol. 13, no. 8, October 1995, pp. 1495-1504.    3. J. T. Brassil, S. Low, and N. F. Maxemchuk, “Copyright Protection for the Electronic Distribution of Text Documents,” Proceedings of the IEEE, vol. 87, no. 7, July 1999, pp.1181-1196.    4. N. F. Maxemchuk, S. H. Low, “Performance Comparison of Two Text Marking Methods,” IEEE Journal of Selected Areas in Communications (JSA C), May 1998. vol. 16 no. 4 1998. pp.561-572.    5. N. F. Maxemchuk, “Electronic Document Distribution,” AT&T Technical Journal, September 1994, pp. 73-80.    6. N. F. Maxemchuk and S. Low, “Marking Text Documents,” Proceedings of the IEEE International Conference on Image Processing, Washington, D.C., Oct. 26-29, 1997, pp. 13-16.    7. S. H. Low, N. F. Maxemchuk, and A. M. Lapone, “Document Identification for Copyright Protection Using Centroid Detection,” IEEE Transactions on Communications, Mar. 1998, vol. 46, no.3, pp 372-381.    8. S. H. Low and N. F. Maxemchuk, “Capacity of Text Marking Channel,” IEEE Signal Processing Letters, vol. 7, no. 12, December 2000, pp.345-347.    9. Ding Huang, Hong Yan, “Interword Distance Changes Represented by Sine Waves for Watermarking Text Images,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 12, pp. 1237-1245, December 2001.
This disclosure describes a method for embedding hidden data in text documents. It also describes methods for automatically extracting the hidden information from electronic text documents, and scanned images of printed text documents. The embedding method forms a message, performs error correction coding on that message (e.g., repetition, BCH, turbo, block, convolution, etc. coding), spreads the error correction coded message over a carrier signal (e.g., a pseudo random carrier signal), and then adjusts spaces or distances between words and lines based on the elements of the spread signal. The spreading may be achieved by selecting corresponding sets of carrier signals, XORing the message with sets of carrier signals, convolving the message with carrier signals, multiplying the message with carrier signals, etc.
The spaces may be adjusted up or down to encode ones and zeros. The spaces can be quantized into quantization bins associated with the symbol to be encoded at the particular location in the document.
The detector operates either on the electronic version of the text document, or on an image scanned from a printed version of the document. The latter poses more challenges because of the distortions introduced by the printing and scanning processes, including noise and geometric distortion.
The detector measures the distances of the spaces between words and lines and maps these distances into symbol estimates. It then performs de-spreading and error correction decoding to retrieve the embedded message. One version of the detector predicts the modification of the space at an embedding location by computing an average space distance from spaces in the document and subtracting this average distance from the actual to predict whether the distance was adjusted up or down (corresponding to a one or zero, for example). Other forms of prediction filters may be used as well, such as a median filter, La Placian filter or some other filter based on a comparison of the space distance at the embedding location with neighboring space distances. Another version of the detector maps the measured distances into quantization bins, and determines the symbol value depending on which bin the measured distance corresponds to.
In applications where the document is distorted, such as reading from a scanned image of a printed document, the detector performs synchronization to determine geometric orientation of the document relative to its original state at the time of embedding. In one implementation, it uses the direction and/or length of the text lines as a reference to compute the orientation. It also identifies the top, bottom, left and right margins. In particular, in one implementation, the detector computes horizontal and vertical signatures. The signatures are one dimensional projections of the document image, summing pixels in a particular direction.
The digital watermarking method can be used for a number of applications:                1. carrying a forensic tracking code to track the document to its printer, creator, or other source device or person;        2. carrying an index to a database with more information or programmatic actions to be associated with or performed on the document;        3. carrying a hash or other authentication data used to cross check other information derived from the document to detect and locate changes made to the document;        4. carrying one or more instructions that control machine or programmatic processing on the document, including copy control (e.g., controlling number of allowable copies, who may view the document, who may receive the document, who may transfer the document, etc.), document routing (e.g., via an email system, fax system, or other document transfer system). More document management applications are described in U.S. patent application Ser. No. 10/639,598, filed Aug. 11, 2003, which is herein incorporated by reference.        
Particular embedding methods described in this document do not require the original document for detecting the watermark, and they work with documents that contain aligned left, centered, aligned right, or justified paragraphs as well as regular or irregular line spacing. Justified paragraphs are common in electronic documents. To force the ends of the last word of each line to lie precisely on the right margin, popular word processor programs automatically and systematically spread the words in each line. Irregular line spacing happens naturally as a result of in-line insertion of mathematical symbols and other objects. To accommodate the tallest object in each line, these word processors automatically adjust the spacing between the lines as necessary.
The invention provides methods for embedding auxiliary data into electronic text documents, and related methods for automatically reading the auxiliary data from electronic or printed text documents.
One aspect of the invention is a method of embedding an auxiliary message in an original electronic text document to form a watermarked text document. The method applies a spreading function to message symbols to spread the symbols over a carrier, which forms a modulated carrier. It maps elements of the modulated carrier to corresponding inter-word spaces in the electronic text document, and applies an embedding function to modify the corresponding inter-word spaces according to elements of the modulated carrier signal such that the modified inter-word spaces hide the modulated carrier signal in the watermarked text document. The message symbols are automatically decodable from the watermarked document without the original electronic text document.
Another aspect of the invention is a method of decoding an auxiliary message in a watermarked text document. This method automatically measures inter-word spaces in the watermarked text document. It estimates elements of a modulated carrier signal embedded in the inter-word spaces to form an estimated modulated carrier signal, and applies a de-spreading function to the estimated modulated carrier signal to extract message symbols.
Further features will become apparent with reference to the following detailed description and accompanying drawings.