Hash functions are widely used in cryptography, where the main purpose is to check the integrity of the data. Since the resulting hash value is highly sensitive to every single bit of the input, these functions are extremely fragile and cannot be adopted for hashing multimedia data. In multimedia hashing, it is more important to be sensitive to the content rather than the exact binary representation. For instance, the raw video, its compressed version, its low-pass filtered version, its increased brightness version, and its decreased contrast version should yield the same hash value since their content is essentially the same but their binary forms are very different. So, an alternate way to compute the hash is needed for multimedia applications, where the hash function results in the same output value unless the underlying content is significantly changed.
Such a hash function is known as a robust or perceptual hash function. Some of the applications of perceptual video hashing include the following: (1) automatic video clip identification in a video database or in broadcasting; (2) online search in a streaming video; (3) authentication of the video content; and (4) content-based watermarking. The two desired properties of hash functions for multimedia data are robustness and uniqueness. Robustness implies that the hash function should be insensitive to perturbations, non-malicious modifications caused by “mild” signal processing operations that in total do not change the content of the video sequence. These modifications can be caused by the user, such as MPEG compression. Or contrast enhancement can occur during storage and transmission functions, such as transcoding or packet drops. The uniqueness property implies that the hash functions are statistically independent for different content, so that any two distinct video clips result in different and apparently random hash values. See also Buis Coskun and Bulent Sankur, Spatio-Temporal Transform Based Video Hashing, IEEE Transactions on Multimedia, Vol. 8, No. 6 (December 2006).
Image registration is the process of estimating a mapping between two or more images of the same scene taken at different times, from different viewpoints, and/or by different sensors. It geometrically aligns two images—the reference image and the so-called “matching” image. Generally, there are two categories of image differences that need to be registered. Differences in the first category are due to changes in camera position and pose. These sorts of changes cause the images to be spatially misaligned, i.e., the images have relative translation, rotation, scale, and other geometric transformations in relation to each other. This category of difference is sometimes referred to as global transformation or global camera motion (GCM).
The second category of differences cannot be modeled by a parametric spatial transform alone. This category of differences can be attributed to factors such as object movements, scene changes, lighting changes, using different types of sensors, or using similar sensors but with different sensor parameters. This second category of differences is sometimes referred to as independent object motion or local object motion (LOM). Such differences might not be fully removed by registration due to the fact that LOM rarely conforms to the exact parametric geometrical transform. In addition, the innovation that occurs in video frames in the form of occlusion and newly exposed area cannot be described using any predictive model. In general, the more LOM- or innovation-type differences exist, the more difficult it is to achieve accurate registration. See Zhong Zhang and Rick S. Blum, A Hybrid Image Registration Technique for a Digital Camera Image Fusion Application, Information Fusion 2 (2001), pp. 135-149.
Parametric coordinate transformation algorithms for registration assume that objects remain stationary while the camera or the camera lens moves; this includes transformations such as pan, rotation, tilt, and zoom. If a video sequence contains a global transformation between frames, the estimated motion field can be highly accurate due to the large ratio of observed image pixels to unknown motion model parameters. A parametric model which is sometimes used to estimate the global transformation that occurs in the real world is the eight-parameter projective model, which can precisely describe camera motion in terms of translation, rotation, zoom, and tilt. To estimate independent object motion, Horn-Schunck optical flow estimation is often used though it often requires a large number of iterations for convergence. See Richard Schultz, Li Meng, and Robert L. Stevenson, Subpixel Motion Estimation for Multiframe Resolution Enhancement, Proceedings of the SPIE (International Society for Optical Engineering), Vol. 3024 (1997), pp. 1317-1328, as to the foregoing and the details of the eight-parameter projective model.