The public availability of large-scale repositories of visual and geometric data, as well as the introduction of such network-based applications as community photo albums, video blogs, social networks, web television, and peer-to-peer networks, pose challenges in representing, indexing, searching, retrieving, and comparing data from such repositories.
A key problem with such applications is determining similarity among visual data. This may be expressed mathematically as computing a distance (metric) between two visual objects (e.g. images or parts thereof or their descriptors, where an image can be a video frame) in an appropriate space.
One of the fundamental problems in computer vision is finding correspondence between two or more images depicting an object from different views. The knowledge of such correspondence allows reconstructing the three-dimensional structure of the depicted scene (known in the literature as shape from stereo problem). The solution of this problem on a large-scale set of images (e.g. millions of public-domain photographs of a city) is extremely computationally challenging.
Since correspondence finding is a computationally-intensive and ill-posed problem, a feature-based approach is often employed. In this approach, the images first undergo the process of feature detection, by which a set of repeatable stable points is detected in each image (ideally, the same in each view of the object). Next, a local image description around each feature point is constructed by means of a feature description algorithm, which tries to provide a local image description invariant to view point and illumination transformations (which can be approximated as an affine transformation of the image). A standard descriptor employed in computer vision literature is the scale-invariant feature transform, or SIFT (D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 2004). Ideally, given two views of the same object, the extracted features and the corresponding descriptors in two images should be identical, in which case the process is called invariant.
Having the features and the corresponding descriptors, the correspondence between the two images can be found by finding, for each feature point in one image the closest (in the descriptor space) points in the other. This requires computing a distance between the feature descriptors. Typically, some standard distance (e.g. Euclidean metric) is employed. However, such distance fails to capture the true similarity between the descriptors, since the space of descriptors may have a highly non-Euclidean structure. This partially stems from that fact that descriptors are rarely fully invariant to real-world transformations an observed object undergoes. In particular the affine model is just an approximation, and in practice, the image undergoes more complicated transformations that should account for different illumination, lens distortions, and perspective, which are very difficult to model.
Another important application in computer vision is image retrieval: given a query image, find one or more images most similar to the query from a database of images. This application requires computing a global image descriptor, comparing which allows judging the similarity of the underlying images.
Straightforwardly, image retrieval can be extended to video retrieval, by considering a video as a collection of frames.
A particular setting of the image retrieval problem is the problem of copy detection, in which the query image is a modification of an image in the database as a result of editing or some distortion.
A common approach for constructing a global image descriptor is the bag of features paradigm (J. Sivic, A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos”, Proc. International Conference on Computer Vision, 2003), In this approach, the local feature descriptors obtained by the aforementioned process undergo vector quantization in a set of representative descriptors (called a “visual vocabulary”). This way, each feature descriptor is replaced by the index of the closest descriptor in the visual vocabulary (“visual word”). Counting the occurrence of different visual words in the image, a histogram called a “bag of features” is constructed.
Given two bags of features representations of the query and a database image, their similarity is measured by computing a standard distance (e.g. Euclidean metric) between the bags of features. If invariant local feature descriptors (e.g. SIFT) are used in the construction of the bag of features, such a representation is also invariant. In practice, it is very difficult or impossible to construct a descriptor that will be invariant to a wide class of realistic transformations an image can undergo.
Many computer vision methods that work well in small-scale applications fail in Internet visual data analysis applications.
The complexity of the visual data available may make it difficult to employ model-based approaches. For example, trying to model local feature variations as affine transformations or trying to model the variation of an image copy as a result of its modification or editing is hard or even impossible.
Furthermore, the large amount of data available on the Internet may limit the applicability of many typical approaches to such tasks as pattern recognition and data analysis. For example, there may be tens of millions of video clips available at some video clip sites. Dealing with such large amounts of data may lead to complexity in computation and data storage (e.g., in the image retrieval problem at this scale of data requires computing a distance between millions of images).
Further difficulties arise from data heterogeneity. Internet images of a scene or object may be generated by different people, at different times, at different places, and using different equipment. For example, photographs of a single architectural landmark that were uploaded to a public image repository may have been acquired by different people, using different types of cameras, under different lighting or weather conditions, and in different formats or representations. The result may be a great variability found in Internet image data collections. The diversity of such data may create problems in data fusion, or in searching and comparing images.
An even more challenging case of data heterogeneity is multi-modality. Images of similar objects or scenes may be acquired from multiple modalities. For example, images of an object or scene may be captured using imaging devices that operate in different spectral bands (e.g. infrared, visible, ultraviolet, and narrower spectral bands within the broader spectral regions). In medical applications, a single organ of a body may be imaged using various medical imaging modalities such as computed tomography, ultrasound, positron emission tomography, and magnetic resonance imaging. In photograph repositories, an object or scene (e.g. architectural landmark) may be imaged from different viewing directions or angles, or using devices that distort the image in different ways. Such different images of a single object, scene, or organ may be generated by unrelated physical processes. Thus, a feature that is visible or prominent in one image may not be in another. Each separate image may be characterized by distinct statistical properties, dimensionality, and structure.
A need to determine similarity between such multi-modal visual data may arise in various contexts. For example, such problems as fusing data from different sensors, aligning medical images, and comparing different versions and representations of a single object may require determining such similarity.
In the field of medical imaging, a particularly important application is multi-modal data fusion or alignment. In multi-modal data fusion, two or more sets of medical images (which can be two- or three-dimensional) acquired by different imaging devices (e.g. a positron emission tomography and a computed tomography scanner) or different operation modes of the same device (e.g. such as T1, T2 or proton density imaging in magnetic resonance imaging) are to be aligned together and mapped to the same system of coordinates.
In multi-modal alignment, a transformation (rigid or non-rigid) may be applied to an image in order to make it as similar as possible locally to another image. The local similarity is determined by means of some metric, which in the case of multiple modalities is a multi-modal metric.
In the computer vision and pattern recognition literature, metric or similarity learning methods have been described as a possible alternative to applying standard distance to visual data. Metric learning methods may be used to implicitly discover traits in the data that are both informative and discriminate between object categories, as well as being invariant to natural data variability. A supervised metric learning method may be applied to attempt to infer similarity from a set of labeled examples of an object (e.g. a large number of photographs of a person or structure). A large number or labeled examples may enable construction of a training set of examples with known similarity. In particular, if the data can undergo transformations, generating examples of transformed data may enable learning the invariance of the data under such transformations.