The present invention relates to digital video and, more particularly, to a means of incorporating therein and extracting hidden information for authenticity verification.
The widespread use of digital media for recording information has brought with it a need to be able to the authenticity of such records. It is well known that digital media are more susceptible to alteration and manipulation than any previously known medium.
Verification is particularly needed in courts of law, where such records may be tendered as evidence. A mechanism is therefore required to authenticate and verify information and to detect fabrication of, or tampering with evidence. Media tampering refers to any manipulation of media that modifies its content, e.g. image blurring or cropping, and frame eliminating or reordering.
The present invention is concerned with recorded video from a variety of systems, such as security CCTV.
An example of such a system is the NICE-Vision® video recording system (NICE Systems Ltd., Ra'anana, Israel), which performs compression of analog video channels and digitally saves the compressed data (in accordance with the H.263+ standard) on disks that can be accessed and played back, as required.
Digital Watermarks
A watermark is an identifying piece of information (an author's signature, a company logo, etc) embedded into a medium (image, audio, video, etc).
Most prior art deals with digital watermarking, the incorporation of robust identifying information in a digital message or file that enables identification of the source of that message or file. A digital watermark is intended to maintain its identifiability, regardless of subsequent processing of the message data, and to be robust enough to survive at least to the point where the message, itself, becomes unusable. Digital watermarks are normally intended for copyright protection, whereby it is difficult for an attacker to remove or destroy the watermark without damaging the audio-visual content, even if the existence of the watermark or the watermarking method is known.
This is not the same as protection against media content modification, for which the requirements are different, and may even be contrary. Thus, it is desirable that any tampering with content alter the digital signature and thereby betray the tampering. Nevertheless, the art of) digital watermarking can contribute useful concepts and techniques, such as finding suitable locations for hiding information.
Most approaches to media authentication are based on building a content-based digital signature, often called fragile watermarking. A requirement of fragile watermarking is that it be sensitive to alteration of the media. The problem is what to embed and to find suitable places to embed the watermark while maintaining low complexity and near-zero artifacts.
Various techniques used in watermarking for digital images and video are discussed by Raymond B. Wolfgang, Christine I. Podilchuk, and Edward J. Delp in Perceptual watermarks for digital images and video (Proceedings of the IEEE, vol. 87, no. 7, July 1999). This article reviews recent developments in digital watermarking of images and video, where the) watermarking schemes are designed to exploit properties of the human visual system to provide a transparent watermark. It is noted therein that watermarks inserted into the high (spatial) frequency parts of a picture are most vulnerable to attack, whereas watermarks in low-frequency areas are perceptually significant and sensitive to alterations. The article indicates important issues that must be taken into account when watermarking video sequences, such as frame shuffling, dependency between adjacent frames, etc.
Frank Hartung and Bernd Girod, discuss embedding of digital watermarking in MPEG-2 encoded video in the bit-stream domain (Digital watermarking of MPEG-2 coded video in the bit-stream domain, in Proc. Int. Conference on Acoustics, Speech, and Signal Processing vol. 4, pp 2621-2624, Munich, April 1997, which is incorporated by reference for all purposes as if) fully set forth herein). Given an MPEG-2 bit-stream, the variable-length code (VLC) words representing Discrete Cosine Transform (DCT) coefficients are replaced by VLC code words that contain the watermark. The complexity is thereby much lower than the complexity of decoding watermarking in the pixel domain and re-encoding.
Vynne, Thorbjorn, Jordan, and Frederic discuss embedding of a digital signature in a digital video stream for watermarking purposes (Embedding a digital signature in a video sequence, U.S. Pat. No. 5,960,081, which is incorporated by reference for all purposes as if fully set forth herein), by embedding into the x- and y-coordinates of motion vectors. The method includes hybrid selection criteria to avoid objectionable visible artifacts and a method of avoiding problems that arise when fewer than 16 suitable picture blocks and/or vectors are available in a frame to embed the 32 bits of the signature. The system described was implemented on a CRAY T3D massively parallel supercomputer, where a near-real-time (5 frames per second) embedding of the signature was obtainable.
Overview of Video Compression
Video compression reduces the amount of data needed to represent a video sequence so as to enable faster and cheaper transmission through communication links as well as more efficient storage.
Video compression techniques achieve compression by taking advantage of statistical redundancies in video data, including:                Psycho-visual redundancy—reduced by color component interleaving;        Inter-frame temporal redundancy—reduced by motion compensation;        Inter-frame spatial redundancy—reduced by DCT transform and predictive coding; and        Coding redundancy—reduced by entropy coding.        
Some specific techniques for reducing redundancy are discussed below.
H.263+ Video Coding Standard
International standards for video compression include block-based compression standards such as MPEG-2 and H.263+, the standard used in the present invention. Generally, a specific standard can be applied using various algorithms. These compression standards are part of a wider grouping of transform-based compression standards. Other standards include the other MPEG-family embodiments as well as H.261 and other H.263-family embodiments.
The TMN-8 Video Codec—University of British Columbia, Canada H.263+ video codec is the preferred video compression method used in the present invention. This should not be taken to restrict the scope of the current invention.
ITU-T H.263+ (H.263+ in brief) is a low-bit-rate, video-coding standard used in applications, like video telephony and video conferencing, to provide adequate picture quality where communications channels limit transmission rates.
The description presented explicitly here suffices to provide an enabling disclosure of the present invention. Additional information about H.263+ may be found in: G. Cote, Erol B. Gallant, and F. Kossentini, H.263+ Video coding at low bit rates, IEEE Transactions on circuits and systems for video technology, vol 8, No 7, November 1998, and in ITU-T H.263 Recommendation, Video coding for low bit rate communication, Geneva, March 1996, both of which are incorporated by reference for all purposes as if fully set forth herein.
Visual information contained in a picture frame is represented at any point in the spatial domain by one luminance component, Y, and two chrominance components, Cb and Cr. The luminance component of a picture is sampled at a specific resolution, specified by H.263+, while the chrominance components are relatively down-sampled by a factor of two in both horizontal and vertical directions. FIG. 1 depicts the spatial relationship of luminance and chrominance components (each chrominance dot represents two values, Cb and Cr) in H.263+. It is seen that chrominance components are interleaved with the luminance components. Using one common Cb sample and one common Cr sample for every four Y samples, in this way, reduces psycho-visual redundancy.
Pixels of a digital video frame may be conveniently grouped into segments containing a plurality of pixels. Tracking segments between frames can considerably reduce calculation when members of a segment move together, so that all that is needed is to define a segment and a single motion vector that shows how the segment has moved between successive frames. An Inter segment is a segment, the location whereof is predicted from a previous frame; an Intra segment is a segment that is not so predicted.
In H.263+, each frame of an input video sequence is divided into macroblocks (the segments for this system), each consisting of four luminance (Y) blocks followed by a Cb block and a Cr block. Each block consists of 8 pixels×8 lines, as illustrated in FIG. 2.
The H.263+ standard supports inter-frame prediction based on motion estimation and compensation. Two coding modes are applied in the coding process:                Intra mode—wherein a frame is encoded without regard to any preceding frame. Frames encoded in intra mode are called I-frames. The first frame in any sequence is encoded in intra mode and is called an Intra frame.        Inter mode—wherein predicted motion is employed to derive a succeeding frame from a preceding frame. Only prediction error frames are encoded, i.e. the difference between an actual frame and the predicted frame thereof. Frames that are encoded in inter mode are called P-frames. Inter blocks and Inter macroblocks are respectively blocks and macroblocks having a position thereof so predicted. A P-frame may also include Intra macroblocks, which are encoded the same as a macroblock in an I-frame.        
A block-diagram representation of a typical H.263+ encoder is shown in FIG. 3.
The first operation compares an incoming frame with an immediately preceding frame by subtracting (30 in FIG. 3) the latter from the former so that unchanged areas of the picture need not be encoded again, thereby saving bandwidth.
Motion Estimation and Compensation
Motion prediction is used to minimize temporal redundancy. A new current frame is predicted from an immediately preceding frame; by estimating where moving areas have moved to (motion estimation) and allowing for this movement (motion compensation). Each macroblock in a current frame is compared with a shifted macroblock from the previous frame to find the best match. The shift size is restricted to a predefined search area, called a search window. After finding the best match (the most similar macroblock), a motion vector of two components is all that is needed to represent the macroblock's displacement from the previous frame.
Frequency Domain Transform
The H.263+ encoder transforms pictures to a ‘spatial frequency’ domain by means of a Discrete Cosine Transform (DCT), in DCT module 32. The purpose is to minimize spatial redundancy by representing each 8×8 block by as few coefficients as possible. The DCT is particularly good at compacting the energy in a block of values into a small number of coefficients so that relatively few DCT coefficients are required to recreate a recognizable copy of the original block of pixels. For example, a blank homogeneous background can be represented by a single coefficient, the DC coefficient, whereas in the spatial domain, where each pixel is represented separately, the representation is clearly far less compact. The DCT is simple, efficient, and amenable to software and hardware implementation.
The DCT for an 8×8 block is defined by:
            C              m        ,        n              =                  α        ⁡                  (          m          )                    ⁢              β        ⁡                  (          n          )                    ⁢                        ∑                      i            =            1                    8                ⁢                                  ⁢                              ∑                          j              =              1                        8                    ⁢                                          ⁢                                    B                              i                ,                j                                      ⁢                          cos              ⁡                              (                                                                            π                      ⁡                                              (                                                                              2                            ⁢                                                                                                                  ⁢                            ⅈ                                                    +                          1                                                )                                                              ⁢                    m                                    16                                )                                      ⁢                          cos              ⁡                              (                                                                            π                      ⁡                                              (                                                                              2                            ⁢                            ⅈ                                                    +                          1                                                )                                                              ⁢                    n                                    16                                )                                                          ,          ⁢      0    ≤    m    ,      n    ≤    7                  where:α(0)=β(0)=√{square root over (⅛)}and:α(m)=β(n)=√{square root over (¼)} for 1≦m,n≦7.        Bi,j denotes the (i,j)th pixel in the 8×8 block and Cm,n denotes the coefficient of the transformed block.        
The inverse DCT (IDCT) for an 8×8 block is given by:
            B              i        ,        j              =                  ∑                  m          =          1                8            ⁢                          ⁢                        ∑                      n            =            1                    8                ⁢                                  ⁢                              C                          m              ,              n                                ⁢                      α            ⁡                          (              m              )                                ⁢                      cos            ⁡                          (                                                                    π                    ⁡                                          (                                                                        2                          ⁢                                                                                                          ⁢                          m                                                +                        1                                            )                                                        ⁢                  ⅈ                                16                            )                                ⁢                      β            ⁡                          (              n              )                                ⁢                      cos            ⁡                          (                                                                    π                    ⁡                                          (                                                                        2                          ⁢                                                                                                          ⁢                          n                                                +                        1                                            )                                                        ⁢                  j                                16                            )                                            ,          ⁢      0    ≤    i    ,      j    ≤    7.  
The DCT and IDCT are lossless, i.e. there is no loss of information when using perfect accuracy. In H.263+, however, the coefficients are quantized, i.e. stored as integers, by truncating the non-integer part of each, 33. Some information is lost thereby, which causes differences between original and reconstructed data.
The first coefficient in a block of DCT coefficients is the DC coefficient, which contains the average value of the pixels within the block. The other coefficients in the block (AC coefficients) represent the various 2D spatial frequencies. Since adjacent pixels usually carry values close to one another, it is to be expected that, in intra frames, the high-frequency coefficients will contain lower energy than low-frequency coefficients.
The advantage of the DCT over other frequency transforms is that the resultant matrix contains only real numbers, whereas other transforms (such as the Fast Fourier Transform) normally produce complex numbers. In addition to the simplicity of the DCT, it is efficient in implementation, both in software and in hardware.
Quantization and Inverse Quantization
The number of bits needed to represent visual information can be reduced by quantization. In H.263+, an irreversible function is applied in quantizer module 33, that provides the same output value for a range of input values. For a typical block of pixels, most of the coefficients produced by the DCT are close to zero. Quantizer module 33 reduces the precision of each DCT coefficient so that near-zero coefficients are set to zero and only a few significant non-zero coefficients are left. This is done in practice by dividing each coefficient by an integer scale factor and truncating the result. It is important to realize that the quantizer “throws away” information because coefficients that become zero through quantization will remain zero upon inverse quantization; therefore the compression is lossy. In H.263+, a single quantization value is used within a macroblock.
After inverse quantization in inverse quantizer module 34, and a subsequent IDCT process in inverse DCT module 36, the encoder holds a reconstructed frame in a memory 38 and the prediction process ensues.
Entropy Coding
Entropy coding encodes a given set of symbols with the minimum number of bits required to represent them. A priori statistics is used for allocating shorter code words to coefficients and motion vectors that have higher probability of occurrence, and longer codes for infrequently occurring values. For example, the zero-motion vector (0,0) is coded as a one-bit word, since it is very likely to appear. This increases coding efficiency and provides lossless compression as the decompression process regenerates the data completely.
Before applying entropy coding, the quantized DCT coefficients of a macroblock are rearranged from an 8×8 matrix into a one-dimensional array. In H.263+ among others, this is done by scanning the matrix diagonally in zig-zag fashion, as shown in FIG. 4. This rearranges the coefficients according to spatial frequency, from lowest frequency (DC) to highest. The array is encoded using run-length coding (RLC) triplets: (LAST, RUN, LEVEL), each triplet being known as an RLC event. The symbol RUN is defined as the distance between two non-zero coefficients in the array. The symbol LEVEL is the value of a non-zero coefficient that follows a sequence of zeroes. If LAST=1, the current RLC event corresponds to the last coefficient of the current block.
Rearranging the coefficients in zig-zag order achieves greater compactness when representing the coefficients as RLC events. In Intra frames it is obvious, since most of the energy is found at low spatial frequencies, that arranging the coefficients in zig-zag order produces longer sequences of zeroes, which decreases the number of RLC events, thereby achieving better compression.
H.263+ Decoding
A standard H.263+ decoder is essentially the inverse of an H.263+ encoder, and is illustrated in FIG. 3. In brief, the main functions are:
Entropy Decoding
The variable-length codes that make up the H.263 bitstream are decoded 301 in order to extract the coefficient values and motion-vector information.
Inverse Quantization
This reverses 302 the quantization performed in the encoder. The coefficients are multiplied by the same scaling factor that was used in quantizer 33 but, because quantizer 33 discarded the fractional remainder, the restored coefficients are not identical to the original coefficients, and this accounts for the lossiness of the process.
Inverse Discrete Cosine Transform
Inverse Discrete Cosine Transform (IDCT) 303 reverses DCT operation 32 to create a block of samples that typically correspond to the difference values that were produced by motion compensator 38 in the encoder.
Motion Compensation
The difference values are added to a reconstructed area from the previous frame to compensate for those macroblocks that have moved since the previous frame 305 and other changes, such as light intensity and color, 304. The motion vector information is used to pick the correct area (the same reference area that was used in the encoder). The result is a reconstruction of the original frame that, as already noted, will not be identical to the original because of the “lossy” quantization stage, i.e. image quality will be poorer than the original. The reconstructed frame is placed in a frame store 306 and it is used to motion-compensate the next received frame.
Data Encryption Standard
Among the various possible encryption algorithms, the Data Encryption Standard (DES) specifies one of the most widely used encryption systems. The standard provides a mathematical algorithm for encryption and decryption of blocks of data consisting of 64 bits under control of a 56-bit key. (Actually, the key consists of 64 binary digits of which 56 bits are randomly generated and used directly by the algorithm. The remaining 8 bits, which are not used by the algorithm, are used for error detection.)
Only the properties and interface of the algorithm are discussed here. A complete description may be found in Data Encryption Standard (DES), Federal Information Processing Standards, Publication 46-2, December 1993, which is incorporated by reference for all purposes as if fully set forth herein.
The encryption and decryption processes are almost identical except for using an altered schedule for addressing the bits in the key. Decryption may be accomplished only by using the same key as used for encryption. Both the encryption and decryption processes feature input and output block sizes of 64-bit words. The key size, in each case, is 56 bits, extracted from a 64-bit word.
DES properties include:
                Uniqueness of ciphers for a given key—encryption of a set of input words with a different key produces a different set of ciphers;        Key secrecy (a basic condition for strong and reliable protection)—a given set of plain text with a corresponding cipher thereof, can theoretically need up to 256 (i.e. ?72×1015) searches to discover the correct key; and        Efficiency and simplicity—the DES algorithm is simple and easy to implement because it requires only basic calculations, like XOR operations, shifting numbers, and accessing small, pre-known tables.CBC Operation Mode of DES        
There are several operation modes for the DES algorithm. The present invention preferably uses only one of them, the cipher block chaining (CBC) mode. In this mode, each encryption operation depends on the immediately preceding block. Before a block is encrypted, it is XOR-ed with the encrypted version of the previous block. This mode is applicable when encryption a long data sequence into a single cipher word. The CBC operation mode is illustrated in FIG. 5.
A first block B1, which consists of 64 bits, is encrypted using DES with a key, denoted by K1. The resultant output, C1, is XOR-ed (⊕) with the next data block, B2. The XOR-ed word is DES encrypted with key K2, and so on. At the end of the process, a cipher block of 64 bits, Cn, is obtained.
LSB Coding
Consideration must be given to where and how, in a frame, a digital signature should be embedded. The Least Significant Bit (LSB) method takes a given binary number and overwrites its least significant bit with a single bit of signature data: 0 or 1. For example, the number eight is 1000 in binary notation; writing 1 into the LSB yields 1001 (=9) while writing 0 preserves the original value 1000 (=8). Extracting the embedded information is straightforward since the LSB carries an embedded bit without any distortions.
Depending upon the embedded value, embedding information in the LSB might involve loss of original information in the LSB. If the embedded bit has the same value as the LSB of the original number, no error is caused since the original value of the number is preserved; if the respective bits differ, then some original information is lost, irretrievably. Therefore, in general, there is no way of exactly reconstructing the original information.
The advantage of embedding in the LSB is that minimal error is caused thereby, as compared with embedding into more significant bits. Moreover, as the absolute value of an original number increases, the proportional error decreases. Therefore, it is preferable to embed into numbers of high absolute value rather than numbers with low absolute value. In practical terms, the visibility of a digital signature to the naked eye is reduced as the proportional error is reduced.
Summary
As seen above, various attempts have been made to embed signatures into digital video. There is thus a widely recognized need for, and it would be highly advantageous to have, a means of verifying the authenticity and integrity of digital media.