1. Field of the Invention
The present invention is related generally to a data processing system and in particular to a method and apparatus for processing video. More particularly, the present invention is directed to a computer implemented method, apparatus, and computer usable program code for extraction and robust matching of segment based temporal video fingerprints for near-duplicate video identification and video piracy detection.
2. Description of the Related Art
As on-line digital content proliferates, and more and more people continue to access on-line media, there is a growing need to identify copyrighted content. For example, owners of copyrighted audio and video content are interested in identifying and removing unauthorized copies of their copyrighted content on social network and content sharing sites. Social network and content sharing sites permit users to post content, such as music, photos, and videos for viewing by other users of the website. Social network and content sharing sites include, without limitation, YouTube®, Facebook®, and MySpace®. The users of these social network and content sharing sites frequently utilize pirated movies, images, and/or television (TV) shows.
The owners of copyrighted audio and video content are also interested in identifying authorized appearances of their content in order to ensure the owners of the copyrights are compensated appropriately for each occurrence of the copyrighted content. For example, an owner may wish to ensure appropriate compensation is paid for each time that a particular song is played on the radio.
Advertisers on the other hand are interested in monitoring the appearances of their advertisements on television, radio, and/or the Internet, for example, in order to make sure advertising content is aired the appropriate number of times. These applications share in common the need to identify copies or near-duplicates of known copyrighted digital media, such as audio and/or video, from among a repository of unknown media, online videos, radio, and/or television.
Currently available solutions for identifying and protecting copyrighted content include watermarking and fingerprinting. Watermarking inserts a visible or invisible watermark into video content, which identifies the rightful owner of the content. The watermarking technology is designed so that the watermark is automatically transferred to any exact copies of the video as well as to any derivative content that is created based upon the watermarked piece of original content. Any such copies or derivative works, whether authorized or unauthorized, can be identified by scanning for the presence of the watermark embedded within the copied or derivative video content.
However, even though watermarks are designed to be difficult to remove without destroying the video content itself, watermarks can be defeated and removed. If a watermark is successfully removed, the video content becomes permanently unlocked and unauthorized duplication or derivation can no longer be monitored and/or detected via the watermark.
Due to the problems with watermarks, another approach, referred to as content-based fingerprinting and matching of content, has recently been gaining momentum because content-based fingerprinting does not rely on the presence of any watermark in the video content. With this approach, the entire piece of content is considered a “pseudo-watermark”, and is summarized into one or more unique fingerprints that characterize the unique audio-visual aspects of the content. To identify whether two pieces of content are copies or derivatives of each other, the content-based fingerprints for the two pieces of content are compared. If the content-based fingerprints are sufficiently similar, the two pieces of content are declared copies, near-duplicates, or derivatives.
Content-based video fingerprinting includes audio-based fingerprinting methods, which uniquely characterize the audio track or the speech in a video. Content-based fingerprinting is also based on the extraction of key frames from the video, and using their visual characteristics to create visual key frame-based fingerprints. The collection of these frame-based fingerprints is then used to describe each video. The frame-based visual features can be global or local in nature. In other words, the frame-based visual features can be extracted from the entire frame or from one or more regions of a frame.
The content-based fingerprinting typically requires similar fingerprints that are invariant with respect to many common editing operations and image/video processing transformations. Common editing operations include, without limitation, cuts, splices, and/or re-ordering. Image/video processing transformations include, without limitation, cropping, scaling, aspect ratio changes, video re-capturing or re-compressing, global illumination changes, color space conversions, color reductions, data corruption and addition of noise.
Currently available content-based fingerprinting approaches work in varying degrees of success with respect to the range of plausible video transformations that are observed in video copies, primarily due to the fact that the successful matching of fingerprints requires complex frame alignment in addition to a robust frame-based fingerprinting technique. The frame-based fingerprinting technique should be invariant to most transformations.
Content-based fingerprinting becomes inaccurate and unreliable in the presence of frame alignment problems and missing or incorrectly sampled frames. Any image processing transformation that changes the visual appearance of the frames sufficiently can also defeat the frame-based matching approaches. In other words, current content-based fingerprinting is typically unable to detect copies and derivative video content where the video sample has been subjected to editing operations.