1. Field of the Invention
The present invention relates to watermarking of natural language digital text and more specifically to a system and method for watermarking natural language digital text while retaining a meaning of the original natural language digital text.
2. Introduction
The ability to search and access immense amounts of digital text online has become commonplace. As a result of this ability, owners or authors of the digital text have lost control with respect to how the digital text is distributed or used. A way to restore control to authors or owners over distribution and use of digital text is needed.
In audio or image watermarking an input signal s(t) is processed to insert a watermark w(t) via a function ŝ(t)=F(s(t), w(t), k), where k is the secret key. The watermarked signal ŝ(t) is such that the w(t) becomes either visible/audible or retrievable by applying a function G(ŝ(t), k). The function F( ) is designed such that the modified signal is perceptually equivalent to the original signal.
Natural language watermarking poses two research challenges in contrast to audio and image watermarking. First, there is successful experimental work on developing models for auditory and visual perception, whereas automatic semantic text analysis and evaluation is not well developed. Recent progress in machine translation research has led to a first step in addressing adequacy of machine translated text, while other text features such as coherence and fluency are being studied. Second, a number of bits that can be used to carry a watermark on natural language digital text is less than that used for audio or image watermarking. For example, entropy is less than 2 bits (character level 2) for the English language and it is less than 5 bits for standard images. Attacks such as text cropping can further decrease available bits for storing the watermark.
The combinatorial nature of natural language creates another challenge for a watermark embedding process. Natural language has a combinatorial syntax and semantics. Operations on natural language constituents (e.g., phrases, sentences, paragraphs) are sensitive to a syntactic/formal structure of representations defined by this combinatorial syntax.
Use of semantics and syntax of text for insertion of a watermark was previously proposed. In that work, binary encodings of words were used to embed information into the text by performing lexical substitution in synonym sets.
In later work, two algorithms were proposed for embedding information in a tree structure of the text. The watermark was not directly embedded in the text, as is done in lexical substitution, but was instead embedded into a parsed representation of sentences. The utilization of an intermediate representation makes the algorithms more robust to attacks than systems based on lexical substitution. One of the proposed algorithms modifies syntactic parse trees of cover text sentences for embedding while a second algorithm uses semantic tree representations. Selection of sentences that will carry the watermark information depends only on the tree structure. Once the sentences to embed watermark bits are selected, the bits are stored by applying either syntactic or semantic transformations. In that work, semantic transformations were designed to preserve the meaning of the overall text, but did not necessarily preserve the meaning of every individual sentence.