In the current environment of computer networks characterized by an exponential growth in the circulation of electronic text documents such as e-mails over unsecured media (e.g., the Internet), a key issue is authentication. It is not always possible for the recipient of an electronic text document to make sure of its origin and that no one should be able to masquerade as someone else. It is also necessary to verify that it has not been modified, accidentally or maliciously, during transmission.
Accordingly, methods have been proposed to perform such an authentication. The standard solution, which fits well with electronic text documents, consists in adding an integrity information in the form of a Message Authentication Code (MAC)to soft-copy text documents. A MAC is a digest computed with a one-way hash function over the text and which is also made dependent on a key, e.g., a secret-key known only by the sending system and the receiving system so that the latter can check first, that the received document has well been originated by whom shares the secret-key, and second, that the document has not been altered. For example, Secure Hash
Algorithm or SHA specified by the National Institute of Standards and Technologies, NIST, FIPS PUB 180-1, “Secure Hash Standard”, US Dpt of Commerce, May 93, produces a 160-bit hash. It may be combined with a key e.g., through the use of a mechanism referred to as HMAC or Keyed-Hashing for Message Authentication, subject of the RFC (Request For Comment) of the IETF (Internet Engineering Task Force) under the number 2104. HMAC is devised so that it can be used with any iterative cryptographic hash function, including SHA. Therefore, a MAC can be appended to the soft-copy of a text document so as the whole can be checked by the recipient.
Obviously, this method that assumes the addition of checking information to a file has the inconvenience of indeed separating text and checking information. Thus, this information can easily be isolated and removed intentionally, in an attempt to cheat, or accidentally just because intermediate pieces of equipment or communication protocols in charge of forwarding the electronic documents are not appropriate to handle this extra piece of information.
Then, the checking information should rather be encoded transparently into the body of the text document itself (i.e., in a manner that does not affect text format and readability whatsoever). So, it remains intact across the various manipulations it is exposed to on its way to the destination, still enabling the end-recipient to authenticate the document.
Another type of approach to authentication which applies mainly to soft-copy images (which thus may also be used on the image of a hard-copy text document) consists in hiding data into their digital representation, meeting thus the above requirement that checking information should better be merged into the document itself. Data hiding, a form of steganography, that embeds data into digital media for the purpose of identification, annotation, tamper-proofing and copyright has received a considerable attention, mainly because of the copyrights attached to digital multimedia materials which can easily be copied and distributed everywhere through the Internet and networks in general. A good review of data hiding techniques is in ‘Techniques for data hiding’ by W. Bender and al. published in the IBM Systems Journal, Vol. 35, Nos 3&4, 1996. An illustration to the way data hiding may be carried out is the replacement of the least significant luminance bit of image data with the embedded data. This technique which indeed meets the requirement of being unnoticeable (i.e., the restored image is far to be altered to a point where this would become noticeable) may serve various purposes similar to authentication, including watermarking, aimed at placing an indelible mark on an image, or tamper-proofing, to detect image alterations especially through the embedding of a MAC into the soft-copy image. However, having to consider a text as an image would be a very costly and inadequate solution in terms of storage and bandwidth necessary to transmit it. Thus, specially adapted methods have been proposed for encoding and hiding data into soft-copy textual documents.
As described in the above article by Bender, the text encoding and data hiding methods are either open space methods, which handle white spaces (blanks or spaces), or syntactic methods that utilize punctuation and contractions, or                semantic methods, that encode data using manipulation of the words themselves, or        steganographic methods, that encode data by modifying graphical attributes, like those known as line-shift coding, word-shift coding or feature coding methods, that operate by introducing small controlled variations on the spaces between lines, between words or on the bitmap images of characters on a text.        
Open space methods that are based on the manipulation of white spaces and more specifically, inter-word blank characters, inserted by the originator of a text document have been considered as the most simple and convenient way of marking a text that is susceptible to be authenticated, without the addition of a separated MAC, since the information necessary for the checking is then imbedded, somehow hidden, into the text itself, that the casual reader is unlikely to take notice of. These methods are basically based on the idea of encoding and hiding information into a text by inserting, or in a more broad sense, by modifying the “number of blanks” on subsets of (cryptographically selected or not) intervals of the original input text.
However, inserting or deleting blanks for encoding information on a text has the main drawback of modifying and distorting the format of the original input text. Moreover, acting on the number of blanks on the intervals of a text to encode binary information usually requires to assign one inter word interval to encode a single bit. Thus, depending on the amount of information to encode, in order to apply those methods large size texts may be required.
As a conclusion, from the analysis of all above referenced methods for encoding and hiding data on soft-copy texts, a common characteristic to all of them consist in that they are based on encoding information by modifying, in a way or another, some visible features of the original text (e.g., by modifying the number of inter-word spaces, by changing or moving punctuation symbols, by shifting positions of words or lines, by modifying the form of text fonts, by using alternative words, etc.). Thus, all those encoding and data hiding methods modify the format or the visual appearance of the original input text, being thus potentially noticeable when editing.