Many approaches have been proposed to address the problem of content-based image searching, particularly when a database of images is large, and when a query image is a distorted version of the requested database image.
Many of the proposed approaches use feature vectors. A feature vector is an array of numbers that represents a portion of an image. When a new feature vector is received, it is often useful to be able to retrieve similar feature vectors from the database. The similar feature vectors will represent similar images to the image associated with the received feature vector.
When the database is small and the similarity function is fast to compute, then an exhaustive search method can be used. An exhaustive search computes the similarity between a query vector associated with a query image and each record in the database. Such an exhaustive search is too slow for many applications, particularly once the size of the database becomes large. One of the problems with content-based image searching is how to quickly find in the database those feature vectors that match a feature vector of a query image. While many approaches have been proposed, each of the proposed approaches suffers from limitations or inaccuracies.
Hash-based strategies provide approaches that are closest to being both fast and accurate. Hash-based approaches involve computing a hash code for each vector in a database, and using the hash code to associate records with entries in a hash table. At query time, a hash code is computed for a query vector and the hash code is used to quickly find matching records in the hash table. For this strategy to be effective, the hash function should be ‘locality sensitive’, which means the function returns the same hash code for vectors that are close to each other. A locality sensitive hash function partitions a feature space into regions, where each region is associated with a particular hash code.
One problem that exists with the hash-based approaches is that for any hash function there will always be two vectors that are close but return different hash codes. This will occur when the two vectors are located either side of a partition boundary and leads to the problem of false-negative matches. False-negative matches occur when the method fails to find similar vectors because the respective hash codes of the similar vectors are different.
One known approach to this problem is Locality-sensitive hashing (LSH), which uses multiple hash functions with randomly chosen parameters for each of the hash functions. Each feature vector in the database is hashed using all of the hash functions and is recorded in a corresponding hash table. Given a query vector, all the hash functions are used to access the stored records. As each hash function is different, the probability of a false-negative match decreases with an increase in the number of hash functions used. However, an increase in the number of hash functions also increases the amount of memory required for hash storage and the time taken for searching the hash table. Varying the number of hash functions allows for a trade-off between memory, speed, and accuracy to be selected, but LSH requires many hash functions to achieve high accuracy when used with high-dimensional feature vectors. An extension to LSH selects the hash functions to balance the number of allocations to each hash code allows further trade-off between accuracy and speed. The balance is achieved by selecting hash functions for each dimension of the hash code to balance allocations of records and is achieved by selecting the hash functions to jointly optimise the preservation of similarity, and entropy of the hash function. A disadvantage of the LSH extension is that the hash functions are selected during a training phase and will balance record allocations according to the distribution of training data. The effectiveness of the balancing will be decreased by any variation in the distribution of further data compared to the training data. This memory requirement has limited the usefulness of LSH when applied to large databases of high-dimensional vectors.
Another approach is Point Perturbation, which uses a single hash function with the problem of false-negative matches being dealt with in the search step. When given a query vector, the hash table is accessed to get a first list of candidate records. A number of probes are generated by applying a small random perturbation to the original query point. Each probe is used to access the hash table and the retrieved records are added to the list of candidate records. The process of generating additional probes from the original query point is repeated several times. The probability of a false-negative match decreases with an increase in the number of probes used, so varying the number of probes manages a trade-off between speed and accuracy, while having a lower memory requirement than using LSH. The disadvantage of Point Perturbation is the number of probes required for a query vector. Point Perturbation becomes slower as the dimensionality of the vectors increases since, for a single query vector, more probes are required to achieve high accuracy.
Another hashing approach is Hash Perturbation. Hash Perturbation is similar to Point Perturbation, in that hash perturbation performs multiple probes per query, but avoids the problem of needing to randomly perturb the query point. Instead, this method directly perturbs the hash code of the original query point. This is made possible because the hash function produces hash codes that are composed of many small hash codes, where each of the smaller hash codes is a function of exactly one coordinate of the feature vector. An early implementation of this approach is Grid Files. The Grid Files method forms a grid over the space of possible vectors by quantizing each dimension, and associates each grid cell with the records whose vectors fall within the cell. Given a query point, the method determines the grid cells that are within a query radius of the query point. The method then checks the records associated with the accessed grid cells for matching points. Unfortunately, this method is slow for high dimensional spaces. The reason is that when given an n-dimensional space, the number of accessed grid cells for one query is of the order 2n. As each dimension is independently hashed, a hash code is associated with a rectangular region in the space, and a query covers a rectangular region that is composed of the union of the hash cells. In the extreme, each dimension is hashed to a single bit. In that case, the hash code for a vector is the concatenation of the bits from each coordinate. Thus an n-dimensional vector leads to an n-bit hash code. Additional query hash codes are generated by flipping one or more bits in the first hash code.
One problem with Hash Perturbation is that each dimension is independently hashed. The hash function partitions the space into rectangular regions. If a query point is near the corner of a region, then 2n probes are needed to avoid false-negatives. For high dimensional vectors (large n), the number of required probes can significantly limit the speedup provided by hashing. This can be ameliorated by reducing the number of probes per query, but this also reduces accuracy.
Lattice theory has been applied to Point Perturbation and Hash Perturbation, using lattices known as A* and D*. This has led to methods that determine probes for a query which are based on the location of a query point within a Voronoi region. When a record with an associated vector is added to the database, the method determines in which Voronoi region the vector is located, and associates the record with the corresponding lattice point. For example, one method uses a hash code associated with the lattice point. When a query vector is received, the lattice point nearest to the query vector is used to access records associated with the lattice point. Additional probes for the query are determined by calculating the distance from the query point to a wall of the Voronoi region. If the distance is sufficiently small, then the lattice point on the other side of the wall is used as a probe. Unfortunately, when the vectors have a large number of dimensions, the number of walls of a Voronoi region is extremely large and calculating the distance from the query point to a wall is slow. Therefore, this method is inappropriate for systems with high-dimensional vectors and that need accurate and fast queries.
For a random hash function, 2-way chaining can be applied to achieve balanced allocations. The method uses a pair of hash functions, thus providing two hash codes for each object. At insertion time, a greedy algorithm selects the hash code with the lowest number of existing registrations. At retrieval time, both query hash codes are used to retrieve objects. Compared to unbalanced hash allocation with a random hash function, the expected maximum registration to any hash code is reduced exponentially by using the 2-way chaining algorithm.
Thus, a need exists to provide an improved method and system for content-based image searching.