Field
One feature relates to computer vision, and more particularly, to methods and techniques for improving recognition and retrieval performance, processing, and/or compression of images.
Background
Various applications may benefit from having a machine or processor that is capable of identifying objects in a visual representation (e.g., an image or picture). The field of computer vision attempts to provide techniques and/or algorithms that permit identifying objects or features in an image, where an object or feature may be characterized by descriptors identifying one or more points (e.g., all pixel points, keypoints of interest, etc.). These techniques and/or algorithms are often also applied to face recognition, object detection, image matching, 3-dimensional structure construction, stereo correspondence, and/or motion tracking, among other applications. Generally, object or feature recognition may involve identifying points of interest in an image for the purpose of feature identification, image retrieval, and/or object recognition. Preferably, the points may be selected and/or processed such that they are invariant to image scale changes and/or rotation and provide robust matching across a substantial range of distortions, changes in point of view, and/or noise and changes in illumination. Further, in order to be well suited for tasks such as image retrieval and object recognition, the feature descriptors may preferably be distinctive in the sense that a single feature can be correctly matched with high probability against a large database of features from a plurality of target images.
For instance, local image computations may be performed using a Gaussian Pyramid to locate the points of interest. A number of computer vision algorithms, such as SIFT (scale invariant feature transform), are used to compute such points and then proceed to extract localized features around them as an initial step towards detection of particular objects in a scene or classifying a queried object based on it features.
After one or more points in an image are detected and located, they may be identified or described by using various descriptors. For example, descriptors may represent the visual features of the content in images, such as shape, color, texture, rotation, and/or motion, among other image characteristics. A descriptor may represent a point and the local neighborhood around the point. The goal of descriptor extraction is to obtain robust, noise free representation of the local information around points.
The individual features corresponding to the points and represented by the descriptors are matched to a database of features from known objects. Therefore, a correspondence searching system can be separated into three modules: point detector, feature descriptor, and correspondence locator. In these three logical modules, the descriptor's construction complexity and dimensionality have direct and significant impact on the performance of the feature matching system.
Such feature descriptors are increasingly finding applications in real-time object recognition, augmented reality, 3D reconstruction, panorama stitching, robotic mapping, video tracking, and similar tasks. Depending on the application, transmission and/or storage of feature descriptors (or equivalent) can limit the speed of computation of object detection and/or the size of image databases. In the context of mobile devices (e.g., camera phones, mobile phones, etc.) or distributed camera networks, significant communication and processing resources may be spent in descriptors extraction between nodes. The computationally intensive process of descriptor extraction tends to hinder or complicate its application on resource-limited devices, such as mobile phones.
A variety of descriptors have been proposed with each having different advantages. Scale invariant feature transform (SIFT) opens a square patch aligned with the dominant orientation (of pixel gradients) in the neighborhood of a point and sized proportionally to the scale level of the detected point. The gradient values in this region are summarized in a cell with a plurality of bin orientation histograms in each cell. Daisy descriptors have shown better and faster matching performance than SIFT in dense matching and patch correspondence problems. An important advantage of Daisy descriptors over SIFT descriptor is that in constructing a Daisy descriptor the spatial binning of oriented derivatives is representative of different resolutions. More specifically, the spatial bin size is larger (i.e., more course) for the bins located further away from the point. Using different resolutions makes Daisy descriptors more robust to rotation and scale changes. However, to calculate fast spatial binning Daisy descriptors requires an additional memory for building a scale-space of three scales for each image derivative. Another important limitation of the Daisy descriptor algorithm is the additional memory needed for storage (relative to SIFT). For instance, three (3) scale levels are needed for each of eight (8) oriented derivatives. When using Daisy descriptors, the total additional memory is 24×M×N bytes for an M×N image (i.e., assuming a one byte dynamic range for each smoothed pixel). The memory complexity further increases to 24×M×N×S for a scale-space with S scale levels. This limits the extraction of scale-invariant Daisy descriptors, i.e. Daisy descriptors in scale-space.
Therefore, a method is needed to reduce the amount of memory needed to generate and/or store Daisy descriptors in scale space.