The patent application relates to methods and apparatus for embedding data, such as in a document, and recovering the embedded data therefrom.
Digital data hiding has become a hot topic in the signal processing research community. Various processes have been proposed to hide data in images, video files, audio files, and the like. However, relatively little existing art is dedicated to hiding data in text and/or documents. A reason may be because changes to pixel values of text documents may generally result in visual artifacts, referred to here as salt-and-pepper noise.
It is noted that digital representations of documents may represent one pixel by a single bit, although, of course, this is merely an example. In one such arrangement, however, “1” may represent white, and “0” may represent black. Accordingly, hiding data in text documents may benefit from approaches different from those used to embed data in image, video, and/or audio files, for example.
One existing approach employs a block-based method to hide bits into document images. In this approach, patterns are predefined as “flippable” or “non-flippable”. For flippable patterns, certain pixels may be flipped without resulting in a significant amount of perceptible artifacts. Data is embedded by flipping the flippable pixels to result in a total number of black pixels in a block to be even or odd according to whether a bit ‘1’ or bit ‘0’ is to be embedded in the block.
Another existing approach employs “boundary modifications,” in which 100 pairs of five-pixel-long boundary patterns are defined. For each pair, there are two different patterns, an ‘A’ pattern and a ‘D’ pattern, which may be interchanged if the pair is flipped. By flipping between the patterns, a bit may be embedded into a five-pixel-long boundary.
The foregoing methods may be applied to arbitrary text documents and may generally embed between several hundred bits and several thousand bits depending, for example, on the size of the document in which the data is being embedded. However, for both existing methods, distortion inflicted on a “marked document,” here, a document into which data has been embedded, may make correct retrieval of the hidden information more difficult. Distortions to documents may result from a variety of sources including, but not limited, to photocopying, printing, and scanning, to name a few.
Three methods for hiding data in text documents have been proposed: line shift coding, word shift coding, and feature coding. Line-shift coding and word-shift coding are resilient against the effects of printing, copying, and scanning to some extent. One drawback, however, hidden data extraction is difficult without the original document and this document may not be available in many cases. In one existing approach, a baseline method for line shift coding which did not require the original document is disclosed. However, the data embedding capacity is quite limited at about one bit for every two lines.