The invention relates to a method and apparatus for representing an image, and a method and apparatus for assessing the similarity between images.
In Lienhart, R., “Comparison of Automatic Shot Boundary Detection Algorithms”, In Proceedings of Image and Video Processing VII 1999, Proc. SPIE 3656-29, pp. 290-301, January 1999, a method is presented for the detection of shot transitions in a video sequence. For each frame in the sequence, a 3-dimensional histogram in the RGB colour space is created. The difference between consecutive frames in the sequence is then calculated as the difference between their respective histograms, calculated as the sum of absolute bin-wise differences. Shot transitions are then identified by searching for the distances which are above a predetermined fixed threshold. Thus, this method detects shot transitions based solely on spatially insensitive colour content information. Therefore, this method does not make use of the wealth of information that is present in the spatial arrangement and interrelations of colours.
In Zabih, R., Miller, J., Mai, K., “A Feature-Based Algorithm for Detecting and Classifying Scene Breaks”, In Proceedings of 1995 3rd ACM International Conference on Multimedia, San Francisco, Calif. USA, pp. 189-200, 1995, a different method is presented for the detection of shot transitions in a video sequence. For each frame in the sequence, an edge map is calculated. The difference between consecutive frames in the sequence is then calculated based on the number of edges which are present in the first frame but not in the second and the number of edges which are present in the second frame but not in the first. Then, sharp peaks in the time series of this difference measure indicate the presence of a shot transition. Thus, this method detects shot transitions based solely on edge information, which is one type of spatial interrelation information. Although the rationale is correct, this method does not make use of the wealth of information that is present in the colour content of the frame. Furthermore, the edge map creation process is computationally expensive and is meant to reveal only the strongest colour discontinuities within the frame. Furthermore, the method is quite sensitive to motion. Thus, the authors suggest the use of an image registration technique to counteract this shortcoming, but such processes are computationally expensive.
In Dailianas, A., Allen, R. B., England, P., “Comparison of Automatic Video Segmentation Algorithms”, SPIE Integration Issues in Large Commercial Media Delivery Systems, vol. 2615, pp. 2-16, October 1995, another method is presented for the detection of shot transitions in a video sequence. The difference between consecutive frames in the sequence is calculated as the sum of absolute pixel-wise differences. Shot transitions are then identified by searching for the distances which are above a predetermined fixed threshold. Thus, this method detects shot transitions based solely on spatially sensitive colour content information. Although the rationale is correct, this method does not make use of the wealth of information that is present in the spatial interrelations of colours. Furthermore, such a simple processing of the video results in high sensitivity to noise and motion. A motion compensation algorithm could address the motion sensitivity problem, but such processes are computationally expensive.
In Xiong, W., “Shot Boundary Detection”, US 2003/0091235 A1, published 15 May, 2003, a method is presented for the detection of shot transitions based on the combination of different types of information. That method comprises calculating a block-based difference between two frames and, if it exceeds a fixed threshold, declaring a candidate shot transition. In this case, the shot transition is verified by requiring that colour and/or edge differences between the two frames also exceed fixed thresholds. For the calculation of the block-based difference, the frames are divided into blocks and block averages are calculated. Then, the difference between corresponding blocks is thresholded to determine if two blocks are similar or different, and the number of different blocks between two frames is thresholded to determine if two frames are similar or different. The colour difference is the sum of absolute bin-wise differences, while the edge difference uses edge histograms, capturing edge magnitude and direction information.
In Nakajima, Y., Sugano, M., Yanagihara, H., for KDDI CORPORATION (JP), “Picture Searching Apparatus”, US 2004/0091044 A1, published 13 May, 2004, a method is presented for the detection of shot transitions based on (a) correlation between images, (b) correlation between subsampled images, (c) motion between images and (d) motion between subsampled images. There, the correlation between images and between subsampled images is measured as a pixel-wise difference or a histogram difference and the motion between images and between subsampled images is measured based on various motion vector differences.
In Jafarkhani, H., Shahraray, B., for AT&T CORP. (US), “Method for Analyzing Video”, U.S. Pat. No. 6,542,619 B1, granted 1 Apr. 2003, a shot transition detection method is presented which comprises creating two one dimensional projections of a video frame, i.e. row and column projections, performing a wavelet transform on each projection and retaining only the high frequency components (i.e. the wavelet coefficients), and auto-correlating the high frequency components of each transform. For a series of video frames, a shot transition is indicated when the resultant auto-correlation coefficient time curves exhibit a predetermined maximum value. Thus, that method employs spatially sensitive colour content and interrelation information, provided by the wavelet transform, but that information relates not to frames but to frame projections, resulting is great information loss.
In Jacobs, C. E., Finkelstein, A., Salesin, D. H., “Fast Multiresolution Image Querying”, In Proceedings of 1995 ACM SIGGRAPH Conference, Los Angeles Calif., USA, Aug. 9-11, pp. 277-286, 1995, a method for retrieving images similar to a given image is presented. With that method images are initially represented by their Haar wavelet decomposition. Then, this decomposition is truncated, i.e. only the scaling function coefficient (average intensity) and a very small number of the largest magnitude wavelet coefficients are retained. Then, the truncated decomposition is quantised, i.e. only the signs of the wavelet coefficients are retained. Thus, a single image descriptor is formed that characterises the image for the purposes of image retrieval.
In Zhuang, Z.-Y., Hsu, C.-T., Chen, H.-Y., Ouhyoung, M., Wu, J.-L., “Efficient Multiresolution Scene Change detection by Wavelet Transformation”, In Proceedings of 1997 IEEE International Conference on Consumer Electronics ICCE '97, Taipei, Taiwan, June 11-13, pp. 250-251, 1997, a method for the detection of shot transitions is proposed that proceeds to characterise video frames in the same manner described in “Fast Multiresolution Image Querying”. The difference between the methods of “Fast Multiresolution Image Querying” and “Efficient Multiresolution Scene Change detection by Wavelet Transformation” is that with the method of the latter, the perimeter of frames is discarded and frames are reduced to only their central parts. Such an approach leads to great information loss and can result in false video segmentation and/or great over-segmentation when significant motion is present in the video.
A deficiency that is common with both the methods described above is the assumption that a wavelet decomposition can be efficiently truncated by retaining only a very small number of the largest magnitude coefficients. To put this in context, a multi-scale wavelet decomposition of an image plane starting at 128×128 pixels down to 2×2 pixels produces 16383 wavelet coefficients. As those skilled in the art know, truncating this series to a very small number of coefficients on the basis of magnitude, e.g. the 40 or 60 coefficients with the largest magnitude as the authors suggest, results in descriptors which are extremely susceptible to noise, susceptible to partial occlusions for image retrieval and for video segmentation, and susceptible to high video motion and intra-shot lighting effects for video segmentation, to name but a few problems. Quantising the truncated series by retaining only the signs amplifies the problem.
Another significant problem with these methods is that the semantic information attached to the coefficients of the Haar wavelet decomposition is not exploited. Such semantic information includes the particular colour information that a coefficient represents, e.g. R of RGB or Y of YCbCr, the particular image scale in which a coefficient exists, e.g. is it a coefficient at a high image scale capturing fine detail or a coefficient at a low image scale capturing coarse image information, and so on.
Here, methods for assessing the similarity between images are set out, for example for the retrieval of images from a set of images that are similar to a given image or for the detection of frame discontinuities, such as shot transitions or lighting and other effects, in digital video. The methods rely on the extraction of image descriptors capturing spatially sensitive colour content and interrelation information at one or more image scales and across one or more image channels, followed by the combination of the descriptors not in a single but in multiple descriptors distinguished by semantic content and the use of those descriptors in multiple decision frameworks that effectively exploit said semantic content. Thus, unlike the previous methods, it is possible to establish complex relations between images, for example establish that two images depict the same scene but one has a very significant occlusion, such as a person walking in from of the camera, or that two images depict the same scene but captured under different lighting conditions, or that two frames belong to the same shot but appear very different because of global lighting effects.