Along with the growth of computer information technology, transmitting electronic files over networks is becoming more and more common. However, it is quite easy to copy, spread, or even tamper with electronic files. Therefore, the encryption/decryption technology in the cryptology is utilized for encrypting/decrypting the electronic files with software or hardware in order to ensure the security of the electronic files in transmission. Yet the encryption/decryption technology cannot protect decrypted electronic files from being copied and spread. In 1995, information hiding techniques were developed to solve the problem. For example, hidden information indicating certain attributes of an electronic file, e.g., copyright information, is inserted into the electronic file to protect and trace the electronic file when the file is being copied and spread. Among all information hiding techniques, digital watermark has been the focus in recent years.
In the digital watermark technique, hidden tags are embedded into digital multimedia data through signal processing; the tags are usually invisible/inaudible and can only be extracted with a dedicated detector or reader. The digital watermark technique is an important branch of the research on information hiding technique.
The information embedded into a piece of digital work shall possess the following features to be a digital watermark:                Imperceptibility: the embedded digital watermark should leave the piece of digital work nonobviously degraded and should be hard to perceive;        Security of the hiding position: the watermark information should be embedded directly into the file data, rather than into a header or wrapper, and the data should not be lost across varying data file formats; and        Robustness: the digital watermark shall be kept intact or well distinguishable after multiple unintentional or intentional signal processing procedures. Possible signal processing includes channel noise, filtering, digital/analog and analog/digital conversion, re-sampling, cropping, dodging, scaling, and lossy compression encoding.        
Tradeoffs exist between the embedded data quantity and the robustness of a digital watermark in the digital watermark technique. An ideal watermark algorithm is expected to hide a large amount of data and yet be able to resist a variety of channel noise and signal distortion; however, in practical applications the two targets cannot be achieved at the same time. That problem does not affect the application of the digital watermark technique since normal applications usually focus on one of the two targets only. When the main purpose of an application is to make the digital watermark imperceptible in the communication, obviously the data quantity shall be of primary importance. Because of the extremely high imperceptibility of the digital watermark, it is hardly possible for the digital watermark to be attacked and manipulated by others. Therefore the robustness of the digital watermark is not highly important. On the other hand, when data security is of primary importance, the robustness of the digital watermark is critical because confidential data are facing the danger of theft and manipulation all the time, and the requirement for hidden data quantity will be secondary.
Typical digital watermark algorithms in the prior art convert both the information to be embedded and the target data into images. Some of the typical digital watermark algorithms are as follows.
The Least Significant Bit (LSB) algorithm, the digital watermark algorithm introduced by L. F. Turner and R. G. van Schyndel, is a typical information hiding algorithm in a spatial domain. According to the algorithm, random signals are generated from a specified secret key through an m sequence generator, arrayed into 2D watermark signals in accordance with certain rules, and inserted into the lowest bits of corresponding pixels in an original image. Since the watermark signals are hidden in the lowest bits as very weak signals superposed on the pixels, the watermark is hardly visible or audible. An LSB watermark can be detected by performing some operations on the image for detection and a watermarked image and making a statistical decision. Early digital watermark algorithms, e.g., Stego Dos, White Noise Storm, and STools, all belong to LSB-based algorithms. The LSB algorithm allows a large amount of hidden information; however, the hidden information can be removed easily and thus fails the requirement for digital watermark robustness. Therefore the LSB algorithm is seldom used by modern digital watermark software. Nevertheless, as a method for hiding a large amount of data, the LSB algorithm is still very important in hiding communication.
The patchwork algorithm, a digital watermark algorithm introduced by Walter Bender etc. in the MIT Media Lab, is mainly used for fake-proofing of printing bills. A patchwork digital watermark is hidden in a statistic characteristic of a specific image. The patchwork shows excellent performance concerning robustness and effectively resists cropping, grayscale correction, lossy compression, etc. The disadvantages of the patchwork include that it only allows a small amount of data, is sensitive to affine transform, and also is vulnerable to multiple copy averaging.
Texture block coding, which hides watermark data in the random texture patterns of an image, covers watermark information by utilizing the similarity between texture patterns. The algorithm resists filtering, compression, and distortion, but it needs human operators for the process.
The digital watermark algorithm in the Discrete Cosine Transform (DCT) domain, the most-studied digital watermark algorithm, shows outstanding performance concerning robustness and imperceptibility. The core idea of the algorithm is to superpose watermark information on the intermediate-low frequency coefficients in the DCT domain of an image. The reason for having the algorithm choose the intermediate-low frequency coefficients is that the human visual system is mostly sensitive to intermediate and low frequencies and, therefore, a hacker attempting to destroy the watermark will inevitably degrade the image quality to a great extent while normal image processing procedures usually leave the data in the intermediate-low frequencies intact. The core of compression algorithms such as JPEG and MPEG includes quantization in the DCT domain; hence, skillful integration of the watermarking and the quantization enables the watermark to resist lossy compression. In addition, a comparatively accurate mathematical model has been developed for the statistical distribution of the DCT domain coefficients from which the information quantity of a watermark can be estimated theoretically.
The direct sequence spread spectrum watermark algorithm is an application of spread spectrum communication technology in the digital watermark technique. Different from methods in conventional narrowband modulation communication, information in the spread spectrum technique is distributed in a very wide frequency band after spread spectrum coding modulation, which makes the information pseudorandom. The information receiver de-spreads with corresponding spread spectrum codes to retrieve the original information. The spread spectrum technique effectively resists interference and is highly secure; thus it is widely used for military applications. In fact, the spread spectrum technique can be regarded as a type of radio steganogram method. From the perspective of human perception rather than information theory, the spread spectrum technique is secure because the information to be transmitted is disguised as channel noise and thus hard to be distinguished. The spread spectrum watermark algorithm, similar to the spread spectrum technique, processes the watermark information through spread spectrum modulation and superposes the modulated information on the original data. With regard to the frequency band, the watermark information is spread across the whole spectrum and cannot be restored with normal filters. A large amount of noise must be added in all frequency bands to crack the watermark, which undoubtedly damages the quality of the original data to a great extent.
There are other transform domain digital watermark algorithms. Digital watermark algorithms in the transform domain are not limited to algorithms in the DCT domain or Fourier transform. All types of signal transform are acceptable as long as the transform hides watermark information well. In recent years, many researchers have tried wavelet transform or other time/frequency analysis to hide digital watermark information in a time/scale domain or time/frequency domain, and have yielded satisfactory results.
The major criteria used for evaluating a digital watermark algorithm include the following items.
Immunity to interference (robustness): The digital watermark technique requires robustness, i.e., a digital watermark should be able to resist attacks from a third party, normal data processing and transforming, and standard data processing and transforming. This means that even when a hacker knows that important information is hiding in the transmitted data, the hacker cannot extract the important information or destroy the watermark without seriously damaging the host data. A robustness test includes an active attack process to test a digital watermark for its dependence on data synchronization, the ability to resist various kinds of linear and nonlinear filtering, and the ability to resist other attacks such as geometrical transform.
Embedded information quantity: An algorithm should be able to embed enough specific identification information into a limited amount of original data.
Imperceptibility of the information (interference to the original information): Tradeoffs exist between the information quantity and the imperceptibility of a digital watermark. By increasing the information quantity of the watermark, the quality of a work into which the watermark information is embedded will certainly be degraded. An imperceptibility test evaluates the information quantity and the perceptibility provided by a digital watermark algorithm and determines the exact relation between the watermark information quantity and the data degradation. Indexes in signal processing, e.g., Signal to Noise Ratio (SNR) and peak SNR, as well as physiological models of human visual and audial systems, should be used for evaluating the quality of multimedia data including graphic and audio data otherwise, the evaluation lacks scientific accuracy. This is one of the basic rules for both digital watermark algorithms and data compression techniques.
Security: Security testing mainly evaluates the time needed to crack a digital watermark algorithm and the complexity of the cracking process, which are the main indexes for watermark security.
In the typical digital watermark techniques described above, the identification information, i.e., watermark information, is usually embedded through image processing, which is also suitable for applications including embedding identification information into media files such as images, video, and audio. The techniques regard the files as general streaming media or 2D media, and do not distinguish character information from other information. Frequency domain transform or time domain transform is usually adopted to process the images by transforming a part of the image information to which human eyes are insensitive, e.g., high-frequency information, to embed watermarks. The techniques are similar to data compression algorithms like the JPEG algorithm. Yet conventional digital watermark algorithms take none of the features of specific document types, e.g., an electronic document, into consideration and hence perform poorly in certain fields concerning immunity to interference. For example, in the transmission of electronic official documents, which are basically binary images without grayscale, conventional digital watermark algorithms will create the following two problems:
Quality degradation of the outputted document: Binary images are very sensitive to frequency domain transform while electronic official documents require high definition and are not suitable for full image transform.
Watermark information loss: Printed electronic official documents are most likely to be spread by duplicating while digital watermarks based on image detail transform are very sensitive to the interference in the duplication and scanning processes; watermark information will suffer great loss after the processes, or even more loss due to other interference generated in the spread process, such as pollution, cropping, and soaking, which possibly make the watermarks official documents.
Embodiments of the present invention provide a method for detecting embedded hidden information so that the hidden information can be extracted even when the file with embedded hidden information has been interfered with and transformed several times, e.g., duplicated or photographed with a digital camera.
A method is provided for detecting hidden information, wherein, a document for detection is formed by embedding hidden information in an original document by performing layout transformation on characters in the original document according to a predetermined embedding rule, and the method comprises:
determining layout transformation for each character in the document for detection compared with the original document;
obtaining a code sequence embedded in the document for detection based on the layout transformation of each character in the document for detection and the predetermined embedding rule;
decoding the code sequence to get the hidden information embedded in the document for detection.
The method for embedding and detecting hidden information in an electronic document shows excellent performance in resisting interference and can tolerate common interferences including duplicating, scanning, rubbing, soaking, blotting, cropping, and photographing with digital cameras.
Although characters are used as examples to illustrate the applications of the above methods in the rest of the text, such methods are also applicable by one with ordinary skill in the art to words, letters, strokes, and other lexical elements present in an electronic document.