1. Field
One feature relates to computer vision, and more particularly, to methods and techniques for improving recognition and retrieval performance, processing, and/or compression of images.
2. Background
Various applications may benefit from having a machine or processor that is capable of identifying objects in a visual representation (e.g., an image or picture). The field of computer vision attempts to provide techniques and/or algorithms that permit identifying objects or features in an image, where an object or feature may be characterized by descriptors identifying one or more keypoints. These techniques and/or algorithms are often also applied to face recognition, object detection, image matching, 3-dimensional structure construction, stereo correspondence, and/or motion tracking, among other applications. Generally, object or feature recognition may involve identifying points of interest (also called keypoints) in an image for the purpose of feature identification, image retrieval, and/or object recognition. Preferably, the keypoints may be selected and/or processed such that they are invariant to image scale changes and/or rotation and provide robust matching across a substantial range of distortions, changes in point of view, and/or noise and changes in illumination. Further, in order to be well suited for tasks such as image retrieval and object recognition, the feature descriptors may preferably be distinctive in the sense that a single feature can be correctly matched with high probability against a large database of features from a plurality of target images.
After the keypoints in an image are detected and located, they may be identified or described by using various descriptors. For example, descriptors may represent the visual features of the content in images, such as shape, color, texture, rotation, and/or motion, among other image characteristics. A descriptor may represent a keypoint and the local neighborhood around the keypoint. The goal of descriptor extraction is to obtain robust, noise free representation of the local information around keypoints. This may be done by projecting the descriptor to a noise free Principal Component Analysis (PCA) subspace. PCA involves an orthogonal linear transformation that transforms data (e.g., keypoints in an image) to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate (second principal component), and so on. However, such projection to PCA subspace requires computationally complex inner products with high-dimensional projection vectors.
The individual features corresponding to the keypoints and represented by the descriptors are matched to a database of features from known objects. Therefore, a correspondence searching system can be separated into three modules: keypoint detector, feature descriptor, and correspondence locator. In these three logical modules, the descriptor's construction complexity and dimensionality have direct and significant impact on the performance of the feature matching system. A variety of descriptors have been proposed with each having different advantages. Scale invariant feature transform (SIFT) opens a 12σ×12σ patches aligned with the dominant orientation in the neighborhood and sized proportional to the scale level of the detected keypoint σ. The gradient values in this region are summarized in a 4×4 cell with 8 bin orientation histograms in each cell. PCA-SIFT showed that gradient values in the neighborhood can be represented in a very small subspace.
Most of the descriptor extraction procedures agree on the advantages of the dimensionality reduction to eliminate the noise and improve the recognition accuracy. However, large computational complexity associated with projecting the descriptors to a low dimensional subspace prevents its practical usage. For instance, PCA-SIFT patch size is 39×39, which results in a 2*392 dimensional projection vectors considering the gradient values in x and y direction. Hence, each descriptor in the query image requires 2*392*d multiplications and additions for a projection to a d-dimensional subspace. While this may not generate significant inefficiency for powerful server-side machines, it may be a bottleneck in implementations with limited processing resources, such as mobile phones.
Such feature descriptors are increasingly finding applications in real-time object recognition, 3D reconstruction, panorama stitching, robotic mapping, video tracking, and similar tasks. Depending on the application, transmission and/or storage of feature descriptors (or equivalent) can limit the speed of computation of object detection and/or the size of image databases. In the context of mobile devices (e.g., camera phones, mobile phones, etc.) or distributed camera networks, significant communication and processing resources may be spent in descriptors extraction between nodes. The computationally intensive process of descriptor extraction tends to hinder or complicate its application on resource-limited devices, such as mobile phones.
Therefore, there is a need for a way to quickly and efficiently generate local feature descriptors.