1. Technical Field
The present invention is directed to methods of detecting and describing features from an intensity image.
2. Background Information
Many tasks in processing of images taken by a camera, such as in augmented reality applications and computer vision require finding points or features in multiple images of the same object or scene, which correspond to the same physical 3D structure. A common approach, e.g. as in SIFT (as disclosed in D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision 60, 2 (November 2004), pp. 91-110 (“Lowe”)), is to first detect features in an image with a method that has a high repeatability. This means, that the probability is high that the point corresponding to the same physical 3D surface is chosen as a feature for different viewpoints, different rotations and illumination settings.
A feature is a salient element in an image which can be a point (often referred to as keypoint or interest point), a line, a curve, a connected region or any set of pixels. Features are usually extracted in scale space, i.e. at different scales. Therefore, each feature has a repeatable scale in addition to its two-dimensional position in the image. Also, a repeatable orientation (rotation) is usually computed from the intensities of the pixels in a region around the feature, e.g. as the dominant direction of intensity gradients.
Finally, to enable comparison and matching of features, a feature descriptor is needed. Common approaches use the computed scale and orientation of a feature to transform the coordinates of the descriptor, which provides invariance to rotation and scale. Eventually, the descriptor is an n-dimensional vector, which is usually constructed by concatenating histograms of functions of local image intensities, such as gradients disclosed in Lowe.
FIG. 1 outlines a standard approach of feature detection and description in a flow diagram. First, in step S11, an intensity image I is captured by a camera or loaded which is then optionally subject to pre-processing in step S12. Then, in step S14 after a scale space or a set of discrete scales has been defined in step S13, features are detected in that scale space and their canonical orientation (rotation) is computed and stored with the scale and position of every feature in the image. The detected features are designated with F(s,x,y,o) with s designating the scale, with x, y designating a 2-dimensional position, and o the orientation of the feature F. The extracted features are then described in step S15, with v designating the descriptor, before they are eventually used in an application (step S16).
Limitations of the Standard Approaches:
A strong limitation of any two-dimensional computer vision method is that it operates in a projected space. This makes it impossible to distinguish scale resulting from the distance of an object to the camera from scale resulting from the actual physical scale of an object.
Invariance to scale resulting from the distance of the camera to an object is clearly desirable in many applications, and was the original motivation for scale-invariance. However, in the presence of similar features at different physical scales, invariance to scale makes them indistinguishable. For instance, a descriptor as described in Lowe would not be able to distinguish between a real building and a miniature model of it.
Besides that, approaches that provide scale-invariance by computing a repeatable scale of a feature from image intensities are highly depending on the accuracy and repeatability of this computed scale.
Already Proposed Solutions:
Most naïve approaches to comparing features that use a similarity or distance function on a patch around each feature, e.g. Normalized-Cross-Correlation (NCC) or Sum-of-Absolute-Differences (SAD) are able to distinguish between similar features at different scales. However, these techniques are not invariant to scale resulting from the distance between the camera and an object, which is clearly desirable in real world applications. This means, they would be able to distinguish a real building from a miniature model, but they are not able to match either the building or the miniature model from different distances.
Approaches exist that work on combined range-intensity data. In addition to an intensity image they make use of a range map that contains dense depth information associated to the intensity image. The depth of a pixel refers to the distance between the principal point of the capturing device and the physical 3D surface that is imaged in that pixel.
FIG. 2 shows a scene consisting of two sets of dolls S1 and S2 (each set comprising a tall and a small doll), and a capturing device CD. A physical point PP1 of the set S1 is imaged in the pixel IP1 with the capturing device. The depth of this pixel is D1, the distance between the optical center OC of the capturing device, which defines the origin of the camera coordinate system, and the physical point PP1. Analogously, a second physical point PP2 of the set S2 is imaged in IP2 and has the depth D2. Note that an estimate of the camera intrinsic parameters (in particular focal length) allows for computing the 3D position in Cartesian coordinates of a point PP1 given its depth D1 and its pixel position on the image plane IP1.
E. R. Smith, C. V. Stewart, and R. J. Radke, Physical Scale Intensity-Based Keypoints, 5th Intl Symposium on 3D Data Processing, Visualization and Transmission, May 2010 (“Smith”) present a method to detect and describe features at physical scale from combined range-intensity data that is illustrated in FIG. 4. Given an intensity image and a registered dense depth map (step S41), they first compute a normal for every point in the depth map in step S42. They then project the intensity image pixels onto the tangent planes of the depth points (step S43) and triangulate the back-projected intensity pixels resulting in an image mesh in step S44. All following steps are performed on this image mesh, which is a 3D mesh at physical scale with intensity information for every vertex.
The authors then define a set of physical scales at which to detect features from the mesh (step S45). A smoothing kernel which computes contribution weights based on the distance of a point and the normal of the point is used to smooth both intensities and normals in the image mesh to different physical scale spaces (step S46). Local extrema of the Laplace-Beltrami Operator on the smoothed image meshes are used as features (step S47). For feature description, they use a 3D coordinate frame for each feature defined by its normal and the dominant gradient direction of neighboring pixel intensities projected onto the tangent plane (steps S48, S49). This 3D coordinate frame is used to transform the coordinates of their feature descriptor to provide invariance to rotation and viewpoint (step S410). Eventually, an application uses the described features in step S411.
While this approach clearly improves the feature description process, the creation and processing of the image mesh is very costly and requires dense depth data. After the meshing process, step S48 is another costly step.
A similar approach has been proposed by Wu et al. (e.g., C. Wu, B. Clipp, X. Li, J.-M. Frahm, and M. Pollefeys. 3D model matching with Viewpoint-Invariant Patches (VIP), Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, pp. 1-8, 2008), wherein they use the known local geometry around a feature candidate to compute the tangent plane and carry out feature detection and description on a projection of the textured 3D model onto that tangent plane. However, their descriptor is scale-invariant and therefore does not provide the benefits of the proposed technique.
Note, that throughout this disclosure, the terms “physical scale” and “real scale” are interchangeable.
Related work on feature descriptors on range data:
There exists a variety of literature on extraction and description of features in range images, e.g. as disclosed in J. Stückler and S. Behnke, Interest Point Detection in Depth Images through Scale-Space Surface Analysis. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 2011 (“Stückler”); T.-W. R. Lo and J. P. Siebert, Local feature extraction and matching on range images: 2.5D SIFT. Comput. Vis. Image Underst. 113, 12, pp. 1235-1250, 2009 (“Lo”) These work entirely on range images and do not use any intensity image. Range images encode the distance of the environment towards the camera per pixel. It is possible to display range images representing distances as intensities, but their data origin stays different. Range images can be created in many different ways, when either range data or 3D data exists. When we speak of intensity images throughout this disclosure, we refer to images representing different amounts of light reflected from the environment, mostly depending on the environment's material and the light situation. Intensity images can encode intensity in one (e.g. grayscale) or more than one channels (e.g. RGB—red-green-blue) in different bit resolutions (e.g. 8 bit or high dynamic range).
Stücklerexploits scale space on range images to detect scale-invariant features. Thereby, their approach does not provide distinctiveness of similar features at different scales. Similarly, 2.5D SIFT (e.g., as disclosed in Lo) is an adaption of SIFT to range images without any intensity data. This scale-invariant feature detector and descriptor computes for every feature the surface normal and the dominant gradient direction in the range data around the feature to define a 3D canonical orientation for every feature that is used to transform its descriptor. The latter then computes histograms of shape indices in the support region of the descriptor.
Any naïve approach to describe features in range data, that is not scale-invariant, enables matching of a feature at different distances and discrimination of similar features at different scales.
Such feature descriptors that solely use range images work well for a variety of scenes and objects. However, man-made objects and environments mainly consist of piecewise planar surfaces. While planar surfaces do not contain any useful information for a distinct description, edges and corners in man-made environments are very often perpendicular and highly repetitive. In such cases, the texture, visible in the intensity image, can often provide more distinct information about a feature.
It is an object of the present invention to provide a method of detecting and describing features from an intensity image which is invariant to scale resulting from the distance between the camera and the object, but is sensitive to the real (physical) scale of an object for a variety of applications.