There are many applications where it would be desirable to identify a full length video that is resident in a large database or distributed over a network such as the Internet using only a short video clip. One such application involves the identification and removal of innumerable illegal copies of copyrighted video content that reside in popular video-sharing websites and peer-to-peer (P2P) networks on the Internet. It would be desirable to have a robust content-identification system that detects and removes copyright infringing, perceptually identical video content from the databases of such websites and prevent any future uploads made by users of these web sites.
A computer vision technique that meets the goals of such an application is called video fingerprinting. Video fingerprinting offers a solution to query and identify short video segments from a large multimedia repository using a set of discriminative features. What is meant by the term fingerprint is a signature having the following properties:                1) Fingerprints must be small in dimension and capture all perceptually important video-related information crucial for identifying it.        2) Fingerprints of two different videos must also be significantly different.        3) Matching two fingerprints should be enough to declare the corresponding videos as being the same.        
Practical video fingerprinting techniques need to meet accuracy and speed requirements. With regard to accuracy, it is desirable for a querying video clip to be able to identify content in the presence of common distortions. Such distortions include blurring, resizing, changes in source frame rates and bit rates, changes in video formats, resolution, illumination settings, color schemes letterboxing, and frame cropping. With regard to speed, a video fingerprinting technique should determine a content-match with a small turn-around time, which is crucial for real-time applications. A common denominator of many fingerprinting techniques is their ability to capture and represent perceptually relevant multimedia content in the form of short robust hashes for fast retrieval.
In some existing content-based techniques known in the prior art, video signatures are computed employing features such as mean-luminance, centroid of gradient, rank-ordered image intensity distribution, and centroid of gradient orientations, over fixed-sized partitions of video frames. The limitation of employing such features is that they encode complete frame information and therefore fail to identify videos when presented with queries having partially cropped or scaled data. This motivates the use of a local fingerprinting approach.
In Sivic, J., and Zisserman, A., “Video google: A text retrieval approach to object matching in videos,” ICCV 2, 1-8 (2003) (hereinafter “Sivic and Zisserman”), a text-retrieval approach for object recognition is described using of two-dimensional maximally stable extremal regions (MSERs), first proposed in Matas, J., Chum, O., Martin, U., Pajdla, T., “Robust wide baseline stereo from maximally stable extremal regions,” BMVC 1, 384-393 (2002), as representations of each video frame. In summary, MSERs are image regions which are covariant to affine transformations of image intensities.
Since the method of Sivic and Zisserman clusters semantically similar content together in its visual vocabulary, it is expected to offer poor discrimination, or example, between different seasons of the same TV program having similar scene settings, camera capture positions and actors. A video fingerprinting system is expected to provide good discrimination between such videos.
Similar to Sivic and Zisserman, as described in Nister, D., and Stewenius, H., “Scalable recognition with a vocabulary tree,” CVPR 2, 2161-2168 (2006) (hereinafter “Nister and Stewenius”), Nister and Stewenius propose an object recognition algorithm that extracts and stores MSERs based on a group of images of an object, captured under different viewpoint, orientation, scale and lighting conditions. During retrieval, a database image is scored depending on the number of MSER correspondences it shares with the given query image. Only the top scoring hits are then scanned further. Hence, fewer MSER pairs decrease the possibility of a database hit to figure out within the top ranked images.
Since a fingerprinting system needs to identify videos even when queried with short distorted clips, both Sivic and Zisserman and Nister and Stewenius become unsuitable, since strong degradations such as, blurring, cropping, frame-letterboxing, result in a fewer suitable MSERs found in a distorted image as compared to its original. Such degradations have a direct impact on the algorithm's performance because of a change in the representation of a frame.
In Massoudi, A., Lefebvre, F., Demarty, C.-H., Oisel, L., and Chupeau, B., “A video fingerprint based on visual digest and local fingerprints,” ICIP, 2297-2300 (2006), (hereinafter “Massoudi et al.”), Massoudi et al. proposes an algorithm that first slices a query video in terms of shots, extracts key-frames and then performs local fingerprinting. A major drawback of this approach is that even the most common forms of video processing such as blurring and scaling, disturb the key-frame and introduce misalignment between the query and database frames.
Accordingly, what would be desirable, but has not yet been provided, is a method and system for effectively and automatically matching a video clip to one of a plurality of stored videos using a fingerprint technique derived from the video clip that is fast and immune to common distortions.