The exemplary embodiment relates to image processing and finds particular application in connection with object detection in images.
There are many cases where it is desirable to match objects in images acquired by different cameras in different locations. For example, still cameras or video cameras may be positioned to acquire images for use in automated or semi-automated toll assessment for toll roads and bridges, automated monitoring of a parking facility, camera based enforcement of speed limits or other traffic regulations, monitoring of carpool lanes, roadway usage studies, and the like. Depending upon the application, the vehicle images that are acquired may be an image of the entire vehicle, or an image of a portion of the vehicle, such as the rear license plate.
One problem with matching an object in different images (referred to as re-identification) is that the imaging conditions may be different. The difference in imaging conditions may be due to various reasons, such as cameras placed at different angles, differences in backgrounds, lighting conditions, due for example, to the time of the day or different weather conditions, camera settings, camera resolution or other camera characteristics, amount of motion blur, and post-processing. In general, if the difference in imaging conditions is significant, then it may impact computer vision tasks, such as object recognition or image matching. One reason is that even when the same features are extracted in both instances, the imaging conditions can strongly affect the feature distribution. This means that the assumptions of the classifier trained for one set of conditions do not always hold for the other.
For image matching, a feature-based representation of a captured image is often generated. For example, one method of representing an image or a part of an image is with a Fisher Vector (FV). In this method, it is assumed that a generative model exists (such as a Gaussian Mixture Model (GMM)) from which descriptors of image patches are emitted, and the Fisher Vector components are the gradient of the log-likelihood of the descriptor with respect to one or more parameters of the model. Each patch used for training can thus be characterized by a vector of weights, one (or more) weight(s) for each of a set of Gaussian functions forming the mixture model. Given a new image, a representation can be generated (often called an image signature) based on the characterization of its patches with respect to the trained GMM.
In a typical transportation application, cameras are placed at various strategic locations: for example, at various toll booths, and each camera is independently trained and thereafter used to generate representations of vehicles at (or passing through) the location. If two representations match, it can be assumed that the vehicles are the same. However, even small variations between the images captured with different cameras can impact performance significantly.
Domain adaption techniques have been developed for adapting data from one domain to use in another. Jiang, J., “A literature survey on domain adaptation of statistical classifiers,” Technical report pp. 1-12 (2008), and Beijbom, O. “Domain adaptations for computer vision applications,” Technical report, arXiv:1211.4860v1 [cs.CV] 20 pp. 1-9 (November 2012) provide surveys focusing on learning theory and natural language processing applications and computer vision applications. Some approaches focus on transforming the feature space in order to bring the domains closer. In some cases, an unsupervised transformation, generally based on PCA projections, is used. See, Gopalan, R., et al., “Domain adaptation for object recognition: An unsupervised approach,” ICCV, pp. 999-1006 (2011); Gong, B., et al., “Geodesic flow kernel for unsupervised domain adaptation,” CVPR, pp. 2066-2073 (2012); and Fernando, B., et al., “Unsupervised visual domain adaptation using subspace alignment,” ICCV, pp. 2960-2967 (2013). In others, metric learning that exploits class labels (in general both in the source and in the target domain) is used to learn a transformation of the feature space such that in this new space the instances of the same class become closer to each other than to instances from other classes, independently of the domain to which they belong. See, Zha, Z.-J., et al., “Robust distance metric learning with auxiliary knowledge,” IJCAI, pp 1327-1332 (2009); Saenko, K., et al., “Adapting visual category models to new domains,” ECCV, Vol. 6314 of Lecture Notes in Computer Science, pp. 213-226 (2010); Kulis, B., et al., “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” CVPR, pp. 1785-1792 (2011); and Hoffman, J., et al., “Discovering latent domains for multisource domain adaptation,” ECCV, Vol. Part II, pp. 702-715 (2012).
Many of these techniques are geared toward classification problems and would therefore be difficult to apply to a matching problem, such as re-identification, where there is no notion of class. Others require significant amounts of training data, which is not practical for many applications.
The exemplary embodiment provides a system and method for generating image representations, such as Fisher Vectors, which reduces the effect of difference in imaging conditions on image matching.