The increase in the storage capacity of terminals and in the transmission rates in telecommunications networks has lead to the emergence of new services that facilitate consumption of multimedia contents.
Thus, content providers propose on-line services for downloading multimedia contents, which services are generally paid-for services. For contents that are protected by copyright, it is the content providers who ensure compliance.
Furthermore, the number of sites for exchanging contents where contents are made available on line by the users of those sites continues to increase. Some of those multimedia contents are created by the users themselves. Other contents comprise protected contents that are made available illegally for downloading.
It is therefore necessary to be able to detect illegal copies of a protected content.
In general, detecting copies of multimedia contents consists of searching for the presence or absence of a query content in a reference database of multimedia contents.
Such a database includes descriptors of reference multimedia contents. Conventionally, a descriptor is a numerical value or a set of numerical values characterizing a portion of the multimedia content. For example, when the multimedia content is a video, a descriptor may be defined for each of the images of the video or for a subset of them.
In order to search for the presence or absence of a query content in the reference database, the first step is to calculate the descriptors for the query content. The calculation is performed in identical manner to the calculation that was performed to determine the descriptors in the reference database.
Thereafter, a search is made in the reference database to see whether it contains descriptors that are identical or similar to those calculated for the query content. If the result is positive, it can be deduced that the query content is a copy of the multimedia content for which the descriptors have been found in the reference database.
The quality and the effectiveness with which multimedia content copies are detected depends on the properties of the descriptors. They must be suitable for being calculated rapidly. They must facilitate searching in the reference database. Such descriptors must also make it possible to detect a copy, even if the query multimedia content has been subjected to large amounts of transformation compared with the reference multimedia content (such as for example a high degree of compression, a change of resolution, text or a logo being overlaid therein, etc.). Such transformations may be unintentional, e.g. such as transformations due to recording the content, to transcoding it, etc. Other transformations may be intentional, specifically to make illegally copied content difficult to detect.
When the multimedia content is an image, a set of images, or indeed a video, various types of descriptor can be defined.
Certain descriptors are calculated overall for an image.
Other descriptors are calculated for a portion of an image referred to as a region of interest. For a given image, several regions of interest may be identified and a respective descriptor may be calculated for each of them.
Descriptors for regions of interest in an image provide better performance than an overall descriptor of the image in terms of detecting copies of a video (or of an image or of a set of images) when the image has been subjected locally to high levels of transformation. The term “high levels of transformation” is used, for example, to cover partial masking, inserting a logo of large size, inserting a video in an original video, image cropping, etc. Even if certain regions of a video (or of an image or of a set of images) are completely missing or masked, the video remains identifiable because of the descriptors of the regions of interest that have been subjected to little or no transformation. An overall descriptor of a video (or of an image or of a set of images) is spoilt when it has been subjected to a high level of transformation.
In the article entitled “Feature extraction and a database strategy for video fingerprinting”, Proceedings of the 5th International Conference Recent Advances in Visual Information Systems, 2002, J. Oostveen et al. propose a binary overall descriptor of an image for use in detecting copies of videos.
A first image is subdivided into rectangular blocks (e.g. 36 blocks of four rows by nine columns). A value is calculated in each of the blocks, such as for example the mean of the pixel luminances in the block.
Thereafter, the difference is calculated between the value obtained in a block and the value obtained in the following block on the same row. This produces 32 values, i.e. 4×8 values.
The same procedure is applied to the following image.
Thereafter, the difference is calculated between a value of the first image and the corresponding value of the following image. This produces 32 new values.
A 1 or a 0 is given to the descriptor depending on the sign of the difference as calculated in this way.
The above operations are repeated on the following pairs of images for a set of contiguous images in the video.
Thereafter, all of the descriptors (the 32 binary values in the above-mentioned example) are concatenated so as to form a final descriptor.
The drawback of such an overall descriptor is that its performance is poor in detecting copies of a video (or of an image or of a set of images) after being subjected to high levels of transformation as described above.
In the article entitled “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, Vol. 60, No. 2, 2004, D. G. Lowe describes a descriptor defined by region of interest in an image and used for detecting copies of videos.
The descriptor is defined for a circular region of interest. The region is said to be “scale invariant” insofar as a change in image resolution does not change the overall content of the region of interest.
In order to calculate the descriptor of a region of interest, a square is defined that encompasses said region. The square is then subdivided into blocks.
In each block, a vector gradient is calculated for each pixel. The amplitude and the orientation for each of these vector gradients are then extracted. Thereafter, for each block, a histogram is drawn up of the orientations of the gradients, with the value of each orientation being weighted by the corresponding amplitude.
The descriptor for a region of interest is defined by concatenating the histograms obtained for the blocks making up a square that encompasses said region.
Such a descriptor is referred to as a scale invariant feature transform (SIFT).
The components of a SIFT descriptor are real numbers. Consequently, such a descriptor is more voluminous, more complex, and more difficult to use than a binary descriptor.