1. Field of the Invention
The present invention relates to a feature conversion device that converts a feature vector into a bit code using a transformation matrix in order to search similar information, and a similar information search apparatus provided therewith.
2. Description of Related Art
In a similar information search technology such as an image search, voice recognition, a text search, and pattern recognition, a feature vector is used in processing for evaluating a degree of similarity between a certain piece of information and another piece of information. The feature vector is one into which the information such as an image, a voice, and a text is converted so as to be easily dealt with by a computer. The feature vector is expressed by a D-dimensional vector. For example, an image A and an image B are regarded to be similar in the case of a small distance between the feature vector of the image A and the feature vector of the image B. Similarly, a voice waveform C and a voice waveform D are regarded to be similar in the case of the small distance between the feature vector of the voice waveform C and the feature vector of the voice waveform D. Thus, in the similar information search technology such as the image search, the voice recognition, the text search, and the pattern recognition, the degree of similarity between the pieces of information is evaluated by comparing the feature vectors.
For example, an L1 norm, an L2 norm, and an intervector angle are used as a scale for the distance between the following feature vectors.x,yεRD 
These scales can be calculated with respect to the following feature vector using expressions (1) to (3).
                              L          ⁢                                          ⁢          1          ⁢                                          ⁢          norm                ⁢                                  ⁢                                                                          x                -                y                                                    1                    =                                    ∑              i                        ⁢                                                  ⁢                                                                          x                  i                                -                                  y                  i                                                                                                      (        1        )                                          L          ⁢                                          ⁢          2          ⁢                                          ⁢          norm                ⁢                                  ⁢                                                                          x                -                y                                                    2                    =                                                    ∑                i                            ⁢                                                          ⁢                                                (                                                            x                      i                                        -                                          y                      i                                                        )                                2                                                                        (        2        )            
Intervector Angle
                    θ        =                              cos                          -              1                                ⁡                      (                          xy                                                                                        x                                                        2                                ⁢                                                                          y                                                        2                                                      )                                              (        3        )            
In the similar information search technology, information similar to particular information (input information) is searched from a large amount of information (sometimes becomes hundred millions depending on intended use). Therefore, there is developed a technology called a nearest neighbor search technology for searching k most similar feature vectors at high speed from feature vectors of the large amount of information with respect to the feature vector of the input information. A k-nearest neighbor search and an approximate k-nearest neighbor search are well known as the nearest neighbor search technology.
The k-nearest neighbor search is a technology for searching the k feature vectors having the closest distance at high speed from a large amount of feature vectors. For example, k-dtree can be cited as a typical technique of the k-nearest neighbor search (for example, see J. L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM, 18 (9): 509-517, 1975). The approximate k-nearest neighbor search is also a technology for searching the k feature vectors having the closest distance at high speed from the large amount of feature vectors. However, in the approximate k-nearest neighbor search, processing can be implemented at a speed much higher than that of the k-nearest neighbor search (hundreds of times to thousands of times) by permitting an error. For example, LSH can be cited as a typical technique of the approximate k-nearest neighbor search (for example, see Indyk, Piotr, Motwani, and Rajeev, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, Proceedings of 30th Symposium on Theory of Computing (1998)).
Recently, with increasing amount of information dealt with by a computer, frequently the large amount of high-dimensional feature vectors are dealt with in the similar information search technology. Therefore, the following two points become serious problems.
The first problem is that a calculation of a distance between two feature vectors is too slow.x,yεRD 
For example, in the case that a square of an L2 norm is used as a scale of the distance, because of
                          x        -        y                    2    2    =            ∑              i        =        1            D        ⁢                  ⁢                  (                              x            i                    -                      y            i                          )            2      
it is necessary to perform a D-time subtraction, a D-time multiplication, and a (D−1)-time addition. In many cases, because the feature vector is expressed by a single-precision real number (float), a calculation load becomes extremely high. When the feature vector becomes high-dimensional, the calculation load is further increased. When the number of feature vectors dealt with is largely increased, it is necessary to perform the large amount of distance calculation, which further increases the calculation load. Therefore, even if the k-nearest neighbor search algorithm is applied, frequently the sufficient speed is hardly obtained.
The second problem is that a large amount of memory is consumed. In the case that the feature vector is expressed by a 4-byte single-precision real number, the D-dimensional feature vector consumes a 4D-byte memory. An amount of consumption of the memory is increased with increasing dimension of the feature vector. The amount of consumed memory is increased with increasing number of feature vectors. In the case that the feature vector overflows from a main memory, it is necessary to store the feature vector in a secondary domain such as a hard disk. However, in the case that the secondary domain is used, a processing speed is dramatically decreased.
Therefore, recently a technique of solving the two problems has been proposed by performing binary bit coding of the feature vector. Examples of the typical techniques include random projection (for example, see Michel X. Goemans, avid P. Williamson, “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming”, Journal of the ACM Volume 42, Issue 6 (November 1995) Pages 1115-1145), very sparse random projection (for example, see Ping Li, Trevor J. Hastie, Kenneth W. Church, “Very sparse random projections”, KDD '06 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006)), and spectral hashing (for example, see Y. Weiss, A. Torralba, R. Fergus, “Spectral Hashing”, Advances in Neural Information Processing Systems, 2008).
In these techniques, the D-dimensional feature vector is converted into a d-bit binary bit code. The conversion is performed such that a distance in an original space is strongly correlated with a Hamming distance in a space of a post-conversion (for example, see Michel X. Goemans, avid P. Williamson, “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming”, Journal of the ACM Volume 42, Issue 6 (November 1995) Pages 1115-1145, particularly Lemma 3.2 of page 1121 describes a reason to strongly correlate the distance in the original space with the Hamming distance in the space of the post-conversion). The Hamming distance means the number of different bits counted in two corresponding bit codes. The calculation can be performed at extremely high speed, because only the number of bits of 1 is counted after XOR of the two bit codes is calculated. In many cases, speed enhancement can be achieved tens of times to hundreds of times. Additionally the memory in which 4D bytes are originally required can be decreased to d/8 bytes. Therefore, the memory can be saved to a few hundredths to a few thousandths.
Many bit coding algorithm can be described in a general form by an expression (4).y=sgn(f(WTx+b))  (4)
Where x is a (D-dimensional) feature vector, W is a (D-by-d) transformation matrix, b is a (d-dimensional) bias, y is a (d-dimensional) bit code, f(z) is a nonlinear function, and sgn(z) is a sign function (the function returns −1 when a value is negative, and the function returns is 1 when the value is positive). From the expression (4), an element of y becomes either +1 or −1. It is assumed that “1” and “0” are lined up to form the bit code when +1 is set to “1” while −1 is set to “0”.
In the case of the random projection, an element that is sampled from a normal distribution having an average of zero and a dispersion of 1 is used as the element of W. It is assumed that the bias b is an average value or a median of a zero vector and the previously-collected feature vectors. The nonlinear function is defined as f(z)=z.
In the case of the very sparse random projection, the element of W is selected with a probability of {−1,0,1} to {½*sqrt(D),1−1/sqrt(D),½*sqrt(D)}. D is a dimension number of the feature vector. It is assumed that the bias b is the average value or the median of the zero vector and the previously-collected feature vectors. The nonlinear function is defined as f(z)=z. Because W becomes extremely sparse (for example, about 90% of the feature vectors become 0 in the case of the 128-dimensional feature vector), the high-speed calculation can be performed.
In the case of the spectral hashing, a principal component analysis is applied to the previously-collected feature vectors (training set), and a fixed principal component axis is set to a column vector. It is assumed that b is an average of the training set. A trigonometric function is used as the nonlinear function f(z). In the case of the spectral hashing, the shorter bit code can be generated because the binary bit coding is performed based on the learning.
However, the binary bit coding techniques of the related art have the following problems. A first problem is that the bit coding is slow in the techniques of the related art. That is, in the case that the D-dimensional vector is converted into the d-bit code, it is necessary to perform the (D×d)-time multiplication and the (D×(d−1))-time addition in order to calculate WTx of the expression (4). Accordingly, in the techniques of the related art, although the speed of the distance calculation can be enhanced, the bit coding that is of the preceding processing of the distance calculation becomes a bottleneck. The first problem becomes serious with increasing dimension number D of the feature vector. Particularly, the techniques of the related art are very inconvenient in the case that the bit coding is required in real time, for example, in the case that the techniques of the related art are applied to a real-time image search or real-time voice recognition.
A second problem is that the bit code could be long. That is, in the case that W is constructed based on a random number, because a distribution of the feature vector is not considered, the long bit code is required in order to obtain sufficient performance.
In the binary bit coding techniques of the related art, the random projection has the first and second problems, the very sparse random projection has the second problem, and the spectral hashing has the first problem.