Text documents are stored in computers in a digital format and dispersed by printing, scanning, duplicating, etc. Actually, many paper documents (such as contracts and bills) are much more worthy than the multimedia such as an audio, a video and an image. As devices like computers, printers and scanners are applied and popularized, the copying and duplicating become relatively easy. As a result, the security of important text documents becomes an urgent requirement.
On the one hand, it is difficult to trace the source of a duplicated text document without any protection. For example, a duplicating machine, an indispensable auxiliary device in office automation devices, is especially notable and remarkable. Modern copy machines possess a high ability to make high quality copies with advanced functionality, some of which possess an intelligent editing capability and realize communication with other peers. Some advanced copy machines can so much as make a copy of banknote. Such copy machines can print various documents with high quality, which greatly reduces the work load of transcription and improves efficiency. However, this ability of the copy machines produces a problem for the security of important documents, that is, classified documents might be easily copied during transmission so as to lose the security. Then the copy machine becomes a convenient tool for leaking or stealing secrets. In recent years, most of the intercepted classified documents by the customs are copies from which the sources cannot be inspected. Therefore, criminals cannot be convicted. If some important information, such as the name of a person who prints, the name of the printer, the printing time, the physical address of the computer, and the like, can be detected from the paper documents, it will be easy to trace the source of the illegal transmission.
On the other band, when a document is printed, some additional information is required, which is undesired to appear in the text of the document and desired to be re-entered when necessary, such as important private information on a bank bill, like credit line, deposit amount, and home address of a bank customer. Thus, a certain amount of information is required to be hidden, in advance, in the document to be printed. The information should not be identified by human eyes, but can be obtained conveniently by relevant scanning devices or specific reading tools when necessary. As a result, a large amount of iterative input is avoided, and lots of manpower, material resources and time are saved to a certain degree.
The above two problems are identical in essential, namely, a text document is used as a carrier to hide a certain amount of watermarks. When such a document is illegally duplicated and dispersed and a serious consequence is brought, said information can be used to trace the source of the crime. If the document is juggled willfully, said information can be used as the evidence of a prosecution for illegal invading.
To this end, a technology taking a digital image as a digital watermark to hide information is developed. Firstly, an image (such as a head portrait, a company log, a background pattern and the like) in a document is selected. A pre-entered information string is embedded in the image by a special process and the image is then output through a high-accuracy printer or printing device. When an original document is printed, scanned or duplicated, it is added with inestimable random noises which cannot be accurately described by a mathematical model and are related to the inherent performance of the devices such as printers, scanners and duplicating machines. Moreover, with respect to a printed image, the watermark is often distorted by rotation, binarization, over-migration, obvious warp, geometric transformation and so on. Although some processes can be made to the scanned image, the error rate of information identification remains high. In particular, the detection result for a text document duplicated for multiple times is unacceptable. Some people attempt to design the watermark according to the inherent feature of the text document, with the watermark embedded by changing the highly formatted file layout (such as letter shift or line shift) or the file format. However, this method also has serious defects. At first, a special process which is quite complicated is required in the anterior text edit and layout software. Secondly, in order to avoid impacting the effect of a normal document, the letter shift or line shift should not be too large. In this case, the scanned image suffers from the noises seriously all the same, and generally, the watermark is hard to be detected either. Thirdly, since the number of rows in a document is usually constant, the information which can be hidden is relative small. Finally, for a multi-page text, the method is more complicated.
In addition, this method depends on the contents in the document. If no image is contained in the document, the watermark has no carrier. Even if the carriers exist in the document, the image has to be copied from the document and processed specially when the watermark is hidden into the image. After laid out again, the image can be printed out. Besides, a high-resolution device is required during the printing in this method. Therefore, this method is not suitable for printing out documents in office.
In addition, a method for detecting the watermark as well as the operations of the corresponding device is as follow: scanning a document to be detected and an original document so as to obtain a corresponding image; pre-processing the image to compensate the influence caused by the factors such as attenuation, migration, zooming, obscuration and the like, in particular, salt-pepper noises and deflection which must be eliminated; and extracting the information hidden in the image based on the embedding method. This detecting method has relative strict requirements and a large amount of pre-processing. The accuracy of the pre-processing influences the detecting result directly. Moreover, the original image is needed in this method, which makes the detecting process complicated. Generally, since the conditions are quite strict, the identification rate is low, especially for the zoomed or duplicated document.