1. Field of the Invention
This invention relates to a method and apparatus for embedding data in text regions of a document in a visually imperceptible way, and to a method and apparatus for extracting such embedded data. The data embedding is particularly well suited to be implemented during copying of the document. The invention also relates to programs of instructions for implementing various aspects of the embedding and extracting processes.
2. Description of the Related Art
Since facilities for reproducing documents are widely available, it has become important in many situations to be able to track document reproduction. A way that has commonly been suggested is for the copier to somehow embed information that is not readily perceptible visually but can nonetheless be recovered by machine optical scannings. One proposed approach is to add a number of low-amplitude perturbations to the original image and then correlate those perturbations with images of suspected copies. If the correlations are as expected, then the suspected document is very probably a copy. However, this approach tends to introduce an element of judgment, since it is based on varying degrees of correlation. Also, it does not lend itself well to embedding actual messages, such as copier serial numbers.
Another approach is to employ half-toning patterns. If the dither matrices employed to generate a half-toned output differ in different segments of an image, information can be gleaned from the dither-matrix selections in successive regions. But this approach is limited to documents generated by half-toning, and it works best for those produced through the use of so-called clustered-dot dither matrices, which are not always preferred.
Both of these approaches are best suited to documents, such as photographs, that consist mainly of continuous-tone images. In contrast, the vast majority of reproduced documents consist mainly of text, so workers in this field have proposed other techniques, which take advantage of such documents"" textual nature. For example, one technique embeds information by making slight variations in inter-character spacing. Such approaches lend themselves to embedding of significant amounts of information with essentially no effect on document appearance. However, such approaches are not well suited for use by photocopiers, which do not receive the related word processor output and thus may not be able to identify actual text characters reliably.
Thus, what is needed is a data embedding technique that exhibits advantages of text-based approaches in a way that is more flexible and robust than traditional approaches.
Therefore, it is an object of the present invention to provide a technique for identifying blocks comprised mainly of pixels that meet certain criteria typical of text-character parts and embedding the intended message by selectively labeling text pixels in blocks thus identified with a particular color.
It is another object of this invention to provide a technique for extracting a message embedded using the above embedding technique.
According to one aspect of this invention, a method for embedding a message in a text-containing document is provided. The method comprises the steps of obtaining a pixel representation of the document; identifying text pixels of the document; determining each text line of the document; partitioning each determined text line into a plurality of blocks; identifying each block as valid if that block contains at least a predetermined percentage of text pixels; and embedding a binary element in each valid block by labeling text pixels within that block with a first color or a second color to embed the message in the document.
The message is one or more characters in length, with each character being represented by one or more binary elements (e.g., a bit stream). In a preferred embodiment, each character of the message is comprised of a first binary element sequence, and each such first binary element is, in turn, comprised of a second binary element sequence. Preferably, the bit stream of only one character is embedded in each text line but that character""s bit stream is embedded multiple times in that text line. Depending on the number of lines of text in the document and the number of characters in the message, one or more of the characters may be embedded in more than one text line.
Preferably, each valid block of a particular text line has a predetermined embedding order, e.g., a column-wise raster order.
Another aspect of the invention involves a method for extracting a message embedded in text of a document. The method comprises the steps of obtaining a pixel representation of the document; forming a first representation of the document in which pixels are classified to locate blocks of pixels in which data is embedded; forming a second representation of the document to extract text lines and identify text pixels; comparing the second representation with the first representation to identify clusters of first and second colored pixels in each text line to determine the location of embedded binary elements of the message; sorting the identified first and second colored clusters in each text line in accordance with a predetermined embedding order; converting the sorted first and second colored clusters in each text line into a sequence of binary elements; and decoding the sequence of binary elements in each text line to determine an embedded character of the message.
Preferably, the sequence of binary elements in each text line is comprised of a plurality of subsets of binary elements, each of which is representative of a character of the message, and the step of decoding preferably comprises performing majority voting in each text line to determine the character of the message embedded in that text line.
The step of forming the first representation preferably comprises filtering and sharpening the pixel representation of the document. In a preferred embodiment, where the pixel representation comprises multiple color components to define corresponding multiple color planes, the filtering is applied on each color plane. The step of forming the first representation also preferably comprises classifying each of the pixels of the document as a first colored pixel, a second colored pixel, or neither. The step of forming the second representation preferably comprises thresholding the pixel representation of the document to identify text pixels.
In other aspects of the invention, apparatuses are provided for embedding a message in a text-containing document and for extracting a message so embedded. Each such apparatus is comprised of various circuits to perform message embedding or extracting operations.
In accordance with further aspects of the invention, each of the above-described methods, or steps thereof, may be embodied in a program of instructions (e.g., software) which may be stored on, or conveyed to, a computer or other processor-controlled device for execution. Alternatively, each such method may be implemented using hardware or a combination of software and hardware.
Although the data embedding approach of the present invention depends on the existence of regions meeting certain criteria, it is not dependent on reliably knowing those regions actually do contain character parts. It can therefore be employed advantageously by photocopiers. Moreover, since the color variations in which the message is embedded occur in regions that are parts of text characters, the variations can be significant from a machine point of view, but do not affect the document""s appearance to a human reader.
Other objects and attainments together with a fuller understanding of the invention will become apparent and appreciated by referring to the following description and claims taken in conjunction with the accompanying drawings.