Image-based retrieval may be used for finding a text document in a database, given a query image of the document. An example of a query image is a webcam-captured image of a hard copy of the document. In an office environment, it is common to have a hard copy of a document that needs to be edited, or a single page from a document for which the remainder is needed. Image-based retrieval may be used to efficiently retrieve the original electronic document without requiring the corresponding hard copy document to contain barcodes or filenames.
The process of retrieving an image consists of processing a query image to extract information, referred to as “features” of the image, which can be used to identify the matching entry in a database index. The information is extracted in the form of a “feature descriptor”, which is usually a vector of numerical values describing the feature. Various feature types have been used for image retrieval, including SIFT (Scale Invariant Feature Transform) and LLAH (Locally Likely Arrangement Hashing). Different features are configured to work for particular content types, and expected noise.
The performance of different feature types can be characterised in terms of accuracy of retrieval, speed of registration of images into a database, speed of retrieval of images from the database, and size of an associated database index. Databases of millions or billions of images exist, and more will be created over time. Methods for efficient and accurate retrieval and storage are needed to cope with the proliferation.
Retrieval of a text document from a database, given a query image, is usually performed by processing the query image to produce feature descriptors, and using the feature descriptors to look up a corresponding document in a database index.
A query document may include significant distortion, such as perspective distortion, crumpling, occlusion, noise and cropping. For this reason, features used for document retrieval are typically “local”, meaning that the features only describe a part of the image. Even with distortion such as cropping and occlusion, some of the local features might still match. The features themselves are also usually configured to be invariant to certain types of distortion, such as affine geometric distortion or changes in illumination.
Features used for text document retrieval typically depend on shapes and relative positions of whole words, since optical character recognition may be impossible due to distortions of the query image. A good feature for text-based image retrieval is highly discriminative, such that the feature describes the image in a way that makes correct matches likely and incorrect matches unlikely. A good feature is also quick to calculate, and requires minimal storage. For example, a “locally likely arrangement hashing (LLAH)” feature extractor uses relative positions of words. The LLAH method is able to differentiate millions of pages, but suffers from poor memory efficiency.
A first step in extracting LLAH image features is to identify feature points, such that each LLAH feature point corresponds to the centre of a word. To create a feature descriptor for a feature point, positions of neighbouring feature points are used. Seven points are chosen to create a feature descriptor, excluding the point being described. A simple approach to selecting the feature points would be to choose seven nearest neighbours.
A feature descriptor may be generated from positions of the seven points by taking all combinations of four points from seven, and calculating a single value for the group of four points. The calculated value is useful as part of a feature descriptor due to its robustness to affine distortion of the input image. This is because areas of shapes are invariant to rotation, translation and shear, while the ratio of areas is invariant to scaling. However, if any three of the nearby points are collinear, then the feature descriptor for the feature point will be invalid, since the ratio will not be defined since there will be a divide by zero since one of the triangle areas is zero. Thus, in this instance, a feature descriptor contains thirty five (35) of the calculated values, which are ordered according to an arbitrary choice of starting point and the rotational ordering of the seven neighbouring points, relative to the feature point. The feature descriptor generated by the above process is one of seven possible feature descriptors, depending on which neighbouring point was chosen as the starting point.
The rotational ordering and use of all combinations of points allow the feature descriptor to capture the same values for a query image feature point as for the original image feature point. However, when features are extracted from a query image, the choice of starting point may not be the same as for the database image. To improve robustness, a feature descriptor is produced for each of the seven possible starting points in the rotational ordering.
As a result, the LLAH descriptor is highly informative. The individual feature values are robust to affine distortions of the image. However, the feature descriptors are not robust to errors in feature point detection, which changes the way points are used in the calculations, nor are the feature descriptors robust to rotation due to the arbitrary starting point. The feature descriptors also contain a high proportion of redundancy, since the feature descriptors consist of thirty-five (35) values calculated from the coordinates of seven (7) points. The redundancy results in high storage requirements.
An extension to the LLAH method adds rotational invariance. The extended method is structurally the same as the original LLAH descriptor using feature descriptors calculated in the same way. A difference is that rather than using an arbitrary neighbouring point as a rotational starting point, the extension to LLAH uses a standard starting point. The standard starting point allows the database to be probed a single time for each collection of neighbouring points used to create a descriptor, rather than once for each point, taken as a starting point. One method for choosing the standard starting point is to take each point in turn, and to calculate the LLAH invariant value for that point in combination with three following points. The point producing the highest value is used as the rotational starting point, with the values for subsequent points used for tie-breaking. Using the highest value allows single probing to be used, although it is sensitive to correct selection of the starting point. In practice, the starting point selection method is highly inaccurate in the case of affine distortion, which is the most significant advantage of the original LLAH method.
Another example of a document retrieval feature is “discrete point based signature”. The discrete point based signatures differ from LLAH in that a preceding step is added to correct perspective distortion caused by the position of a camera relative to the document. After removing perspective distortions, known as deskewing, the discrete point based signatures feature descriptor finds a set of nearest neighbour points, orders the set of points according to radial distance, and then measures the angle between the centre and each of the nearest neighbour points. The use of absolute angles requires high accuracy in the deskew step, and becomes inaccurate when there are complex local distortions.
A further example of a document retrieval feature is a brick wall coding (BWC) feature type. BWC also uses a deskew step. However, the features use normalized lengths of words. The word length is more robust than angle to local distortion. However, the BWC method still relies on an accurate estimation of page angle in order to deskew the page.