Port security is an important component of homeland security for guarding against terror threats. For instance, a vessel may be carrying explosives or may harbor purported terrorists. There is therefore a need for visual monitoring and identification of vessels nearing ports and navigable rivers.
In the past, substantially large vessels, such as tankers and enemy ships, have been detected using ground based radar and/or land-based optical or infrared cameras. Unfortunately, small vessels pose a greater security threat than large vessels, since small vessels frequently do not have on-board radar id systems. Such small vessels need to be tracked in an uninterrupted manner, and live and forensic events need to be detected. As a result, there is a further need in the art for effective detection and tracking of small and large vessels, vessel fingerprinting, and cross-camera association and handoff.
One type of technique employed in the prior art computer vision arts for detecting and tracking moving or still objects is viewpoint-invariant object matching. As used herein, the term “viewpoint-invariant” pertains to refers to the same or different object viewed using the same or different cameras either in still images over time with the same or different cameras in which the object being matched or tracked between images may have different posses relative to each other. The object being tracked may have a small or large amount of tilt, orientation, or scaling differences relative to the same object from one image to another, i.e., different points of view. Prior art viewpoint-invariant object matching methods and system have been configured to adopt 3D models in matching procedures to provide pose-invariant distance measures by applying pose-invariant features such as scale-invariant feature transform (SIFT), by dividing pose space, and by handling SIFT features with pose-specific recognizers.
Compared with other object categories, however, (small) vessel identification presents a number of challenges to applying the aforementioned prior art pose-invariant matching approach. There are a relatively large number of different types of vessels with unique designs. There is a high degree of variation in vessel size, motion, and shape. Under viewpoint changes due to wakes, waves, etc., it is difficult to obtain stable images. In addition, vessels are typically observed from a large distance so that truthful 3D reconstruction is not available in practice, thereby limit the applicability of prior art 3D model-based pose inference or matching methods. Additional difficulties arise when vessels are observed over a large dynamic range of viewpoints, typically far away from cameras. As a result, there may be insufficient resolution for matching under wide variations in target object appearance due to large scale changes. As opposed to vehicle monitoring applications where target objects stay in confined viewpoints, individual vessels may take arbitrary paths, and are thus captured in a wide variety of poses.
The aforementioned problems with view-invariant object matching has been addressed in the vision community with focus on various aspects. At the feature level, there are popular descriptors that possesses scale and rotation invariance such as SIFT, histogram of oriented gradients (HoG) as described in D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 60(2):91-110, 2004, in N. Dalal and B. Triggs “Histograms of oriented gradients for human detection,” Proc. IEEE Conf. on Comp. Vision and Patt. Recog., pages 886-893, Washington, D.C., USA, 2005, IEEE Computer Society, and affine-invariant interest point detectors as described in K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” IJCV, 60(1):63-86, 2004. View invariance in object representation may be obtained by parts based representation where object is represented by a constellation of parts to remove view-dependent geometry as described in R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” CVPR, volume 2, pages II-264-II-271 vol. 2, June 2003 and in M. Fritz, B. Leibe, B. Caputo, and B. Schiele, “Integrating representative and discriminant models for object category detection,” ICCV, volume 2, pages 1363-1370 Vol. 2, October 2005. Generic descriptors however do not provide sufficient discriminability for refined object matching and often produce very sparse feature sets, which is also the case with parts composition-based matching approaches.
Object variation from pose changes may be removed by employing 3D model-based pose inference and matching as described in J. Liebelt, C. Schmid, and K. Schertler, “Viewpoint independent object class detection using 3d feature maps,” CVPR, pages 1-8, June 2008 (hereinafter “Liebert et al.”) and in Y. Guo, Y. Shan, H. Sawhney, and R. Kumar, “Peet: Prototype embedding and embedding transition for matching vehicles over disparate viewpoints,” CVPR, pages 1-8, June 2007 (hereinafter “Guo et al.”). Synthetic 3D object models can provide a very strong cue for resolving pose dependency by discovering partial geometry as described in S. Savarese and L. Fei-Fei, “3d generic object categorization, localization and pose estimation,” CVPR, pages 1-8, October 2007 or object pose as described in Guo et al. To obtain discriminability, Liebelt et. al. adopted image-based descriptors for object class detection. Guo et. al. exploits 3D models to obtain view-normalized exemplar distances for pose invariant vehicle matching. View-invariance can be also handled by learning pose dependent object variation. For example, in the face recognition literature, such techniques include actively learning pose-induced variation, by trying to learn patch-based view alignments as described in A. Ashraf, S. Lucey, and T. Chen, “Learning patch correspondences for improved viewpoint invariant face recognition,” CVPR, pages 1-8, June 2008, by statistically learning pose-invariant features as described in D. Pramadihanto, H. Wu, and M. Yachida, “Face recognition from a single view based on flexible neural network matching,” Robot and Human Communication, 5th IEEE International Workshop on, pages 329-334, November 1996, and by distribution of patch deformation space as described in S. Lucey and T. Chen, “Learning patch dependencies for improved pose mismatched face verification, CVPR, June 2006.
As opposed to learning warping functions directly in the image space, it is desirable to learn view warping in feature space to maintain better discriminability at the feature level. In this spirit, PEET as described in Guo et al. comes the closest to fulfilling this goal. However, unlike Guo et al., it is additionally desirable to explicitly enforce embedded distances to reside on a smooth surface to simplify the determination of the degree of warping between images having different poses.
Accordingly, what would be desirable, but has not yet been provided, is method for object matching and identification across multiple categories of different versions of the same object type, such as a vessel, under viewpoint changes that overcomes the deficiencies in the aforementioned prior art methods.