1. Field of the Invention
The present invention relates to processing vector format data, and more particularly to a data processing apparatus and method for retrieving a predetermined number of data pieces from a database in accordance with a similarity with input vector.
2. Related Background Art
A distance is widely used as a similarity between data expressed by vector. For example, in a character recognition system and a speech recognition system, sampled data is mapped in a feature quantity space spanned by a proper base to store vector-expressed data as a prototype. A distance between prototypes and newly input data is calculated to identify the input data as belonging to a class corresponding to the nearest prototype.
A calculation method having a worst efficiency is an exhaustive search. A calculation amount by this method is in the order of a product of the vector dimension and the number of prototypes.
The calculation amount of a distance or an inner product is recognized as a critical obstacle against database search. Because of recent rapid progress of a computer processing ability, a database can store not only text data but also non-text data such as images and sounds. In order to search such non-text data by using a keyword as in a conventional method, the non-text data is required to be added with a keyword in advance. If it is desired to avoid a work of adding a keyword, it is necessary to perform a similarity search by using feature quantity vector.
Even in searching text data, a similarity search algorithm is used which searches text data by using vector in order to realize flexible search. In this case, the calculation amount becomes a substantial issue in realizing a search system. The number of data pieces stored in a general database is over several hundred thousands. Therefore, as the order of a vector dimension is raised by one, the calculation amount increases desperately by several hundred thousands times.
In order to avoid such a case, it is essential to either lower the order of a vector dimension or reduce the number of data pieces to be calculated. The former corresponds to lowering the order of a vector dimension of a space which expresses data. Therefore, there is a possibility that information necessary for data search is not sufficiently expressed in vector components. The latter becomes meaningful methodology when the number of data pieces requested as search results is sufficiently small as compared with the total number of data pieces. Those cases to be processed by k-NN search belong to this category, and several effective methods have been proposed.
According to the k-NN search, k prototypes nearest to a test vector are searched from a set of prototypes stored in a system, and in accordance with classes of the searched prototypes, the class of the test vector is identified. In this case, one of important issues is how k prototypes nearest to the text vector are searched at high speed. This requirement is also applied to database search.
A search user desires only data pieces nearest to the search key designated by the user, among a large amount of data stored in a database, and does not desire other data pieces at all, much less values of distances and inner products. Techniques satisfying such requirements of a search user are coincident with objectives of a high speed algorithm of k-NN search.
In order to reduce the calculation amount required for searching k prototypes nearest to a test vector from a set of prototypes, each prototype is generally structurized in advance. The more the quality of data is reflected upon when structurization is performed, the more the search calculation amount is expected to be reduced.
For example, if a prototype is structurized hierarchically, an operation of dividing an N-dimensional space expressing a prototype is recursively repeated. A method of dividing the space by using a boundary which is a hyperplane is called a K-D-B Tree [Document 1], a method of dividing the space by a rectangular plane is called an R-Tree [Document 2], a method of dividing the space by a hyper-sphere is called an SS-Tree [Document 3], and a method of dividing the space by a combination of a rectangular plane and a hyper-sphere is called an SR-Tree [Document 4]. If an N-dimensional vector space is mapped to a space spanned by an eigenvector of a covariance matrix representing a prototype distribution, a structurization more effective for reducing a search calculation amount can be expected [Documents 5, 6].
With these methods, however, the calculation amount and storage capacity necessary for data structurization exponentially increases as the order of a vector dimension is raised. Therefore, application to those data expressed by high-dimensional vector may be practically restricted.
[Document 1] J T. Robinson: xe2x80x9cThe K-D-B Tree: A search Structure for Large Multidimensional Dynamic Indexesxe2x80x9d, Proc. on ACM SIGMOD, pp. 10-18, 1981.
[Document 2] A. Guttman: xe2x80x9cR-trees: A dynamic index structure for spatial searchingxe2x80x9d, Proc. ACM SIGMOD, Boston, USA, pp. 47-57, June 1984.
[Document 3] D A. White and R. Jain: xe2x80x9cSimilarity indexing with the SS-treexe2x80x9d, Proc. of the 12th Int. Conf. on Data Engineering, New Orleans, USA, pp. 323xe2x80x94331, February 1996.
[Document 4] Katayama and Satoh: xe2x80x9cSR-Tree: A proposal of index structure for nearest neighbor searching of high dimensional point dataxe2x80x9d, IEICE Papers (D-I), vol. 18-D-I, no. 8, pp. 703-717, August 1997.
[Document 5] R F. Sproull: xe2x80x9cRefinements to Nearest Neighbor Searching in K-dimensional Treesxe2x80x9d, Algorithmica, 6, pp. 579-589, 1991.
[Document 6] D A. White and R. Jain: xe2x80x9cSimilarity Indexing: Algorithms and Performancexe2x80x9d, Proc. on SPIE, pp. 62-73, 1996).
There are algorithms which use xe2x80x9cgentlexe2x80x9d structurization not incorporating statistical means and a little xe2x80x9csmartxe2x80x9d search algorithm, in order to reduce the calculation amount. Of these, one of the most fundamental algorithms is an algorithm by Friedman et al., called a mapping algorithm [Document 7].
[Document 7] J H. Friedman, F. Baskett, and L J. Shustek: xe2x80x9cAn Algorithm for Finding Nearest Neighborsxe2x80x9d, IEEE Trans. on Computers, pp. 1000-1006, October 1975.
A data structurization requested as a pre-process of the mapping algorithm is a sorting process of sorting vector at each component, which process corresponds to structurization based upon a phase. Namely, if a prototype is d-dimensional vector, d sorting lists are generated.
With this process, two lists including a list Vj storing j-component values arranged in the ascending order and a list Ij storing corresponding prototype ID numbers, are formed as many as the order of a vector dimension. Namely, the value Vj(n+1) at the (n+1)-th component value from the start of Vj is equal to or larger than Vj(n) at the n-th component value. The j component value YIj(n)(j) of the prototype YIj(n) having the ID number of Ij(n) is coincident with Vj(n).
A principle of the mapping algorithm for selecting a pair of prototypes nearest to a test prototype from a prototype set will be described with reference to FIG. 10. A search is performed by using a pair of sorting lists Vm and Im selected by a proper criterion. In the example shown in FIG. 10, an m-axis is selected. Im stores the ID number of data sorted based upon the component values, so that the order on the list correctly reflects the phase along the m-axis. First, a value nearest to the m component X(m) of a test vector X is searched from Vm. This value is assumed to be Vm(j). The prototype corresponding to Vm(j) is YIm(j). In the example shown in FIG. 10, YIm(j) corresponds to Y1. Although Y1 is nearest to X with respect to the m component, it is not necessarily nearest to X in the whole space.
Next, a distance xcfx81(X, Y1) between X and Y1 is calculated. It can be understood that there is a possibility that only a prototype having the m component value belonging to an open interval (X(m)xe2x88x92xcfx81(X, Y1), X(m)+xcfx81(X, Y1)) (area A in FIG. 10) is nearer to X than Y1 and that such a prototype is significant in terms of search target. In the example shown in FIG. 10, the next nearest prototype Y2 with respect to the m component is checked so that the prototype set to be searched is further restricted to (X(m)xe2x88x92xcfx81(X, Y2), X(m)+xcfx81(X, Y2)) (area B in FIG. 10). As above, with the mapping algorithm, the prototype set to be searched is made smaller in accordance with the component value in the one-dimensional space to thereby reduce the calculation amount.
It is reported, however, that the mapping algorithm by Friedman et al. lowers its performance as the order of a vector dimension becomes higher [Document 7]. A ratio of the expected number of prototypes whose distances were actually calculated to the total number of prototypes is herein called a relative efficiency xcex7. For the case that one nearest neighbor is searched from a set of 1000 prototypes, xcex7 is 0.03 for two-dimensional vector, whereas xcex7 lowers to 0.6 for nine-dimensional vector.
By representing the number of prototypes picked up from a prototype set by NEXT and the number of prototypes whose distances were calculated by Ng, the calculation amount required for deciding whether a distance calculation is to be performed is O(NEXT), and the calculation amount for actual distance calculation is O(dNg). As Ng becomes near to the value of NEXT, a process overhead is added so that an actual calculation time for nine-dimensional vector may become worse than the exhaustive search. In order to solve this problem that the mapping algorithm cannot be used for high-dimensional vector, Nene et al. have devised a very simple and effective algorithm [Document 8]. This algorithm called xe2x80x9cSearching by Slicingxe2x80x9d leaves as a search candidate only the prototype belonging to a closed interval [X(j)xe2x88x92xcex5, (X(j)+xcex5] spaced before and after the j-th component X(j) of test vector by an amount of xcex5, as a search candidate. Since this algorithm independently evaluates each component, it is apparent that the performance is dependent upon xcex5. Although Nene et al. have proposed a method of deciding a value xcex5, this method is probabilistic and not suitable for high-dimensional vector.
[Document 8] S A. Nene and S K. Nayar: xe2x80x9cA Simple Algorithm for Nearest Neighbor Search in High Dimensionsxe2x80x9d, IEEE Trans. on PAMI, vol. 19, no. 9, pp. 989-1003, September 1997.
It is an object of the present invention to provide a data processing apparatus and method capable of retrieving data relevant to input data from a database having a large amount of data, at high speed.
According to one aspect, the present invention which achieves the object relates to a data processing apparatus comprising: a database storing a set of data of a vector format; list forming means for forming a list of data of the database arranged in an order of a value of each component of a vector, for each component; input means for inputting test data of a vector format; component selecting means for sequentially selecting each component of the vector format; data selecting means for sequentially selecting data in an ascending order of a difference value between the data and the test data from the list, for each component of the vector format; distance calculating means for calculating a distance in a whole space between the data selected by the data selecting means and the test data; retrieving means for retrieving a predetermined number of data pieces in an ascending order of a distance calculated by the distance calculating means; completion judging means for judging, from a difference of a component value between one data piece selected by the data selecting means and the test data, whether data selection by the data selecting means is to be continued or terminated; and distance calculating control means for controlling whether the distance calculating means is to calculate a distance in the whole space, in accordance with a distance in a partial space between the data selected by the data selecting means and the test data.
According to another aspect, the present invention which achieves the object relates to a data processing apparatus comprising: a database storing a set of data of a vector format; pre-processing means for calculating a square of a norm of each data piece in the database and forming a list of data arranged in an order of a value of each component of the vector, for each component; input means for inputting test data of the vector format and operating a metric tensor upon the test data; component selecting means for sequentially selecting each component of the vector format; data selecting means for sequentially selecting data in an ascending order of a difference value between the data and the test data from the list, for each component of the vector format; similarity calculating means for calculating a similarity in a whole space between the data selected by the data selecting means and the test data by using a square of a norm of the data; retrieving means for retrieving a predetermined number of data pieces in a descending order of the similarity calculated by the similarity calculating means; and similarity calculating control means for controlling whether the similarity calculating means is to calculate a similarity in the whole space, in accordance with a similarity in a partial space between the data selected by the data selecting means and the test data.
According to another aspect, the present invention which achieves the object relates to a data processing method comprising: a list forming step of forming a list of data in a database storing a set of data of a vector format, for each component of a vector, the data in the list being arranged in an order of a value of each component; an input step of inputting test data of a vector format; a component selecting step of sequentially selecting each component of the vector format; a data selecting step of sequentially selecting data in an ascending order of a difference value between the data and the test data from the list, for each component of the vector format; a distance calculating step of calculating a distance in a whole space between the data selected at the data selecting step and the test data; a retrieving step of retrieving a predetermined number of data pieces in an ascending order of a distance calculated at the distance calculating step; a completion judging step of judging, from a difference of a component value between one data piece selected at the data selecting step and the test data, whether data selection at the data selecting step is to be continued or terminated; and a distance calculating control step of controlling whether the distance calculating step is to calculate a distance in the whole space, in accordance with a distance in a partial space between the data selected at the data selecting step and the test data.
According to another aspect, the present invention which achieves the object relates to a data processing method comprising: a pre-processing step of calculating a square of a norm of each data piece in a database storing a set of data of a vector format and forming a list of data arranged in an order of a value of each component of the vector, for each component; an input step of inputting test data of the vector format and operating a metric tensor upon the test data; a component selecting step of sequentially selecting each component of the vector format; a data selecting step of sequentially selecting data in an ascending order of a difference value between the data and the test data from the list, for each component of the vector format; a similarity calculating step of calculating a similarity in a whole space between the data selected at the data selecting step and the test data by using a square of a norm of the data; a retrieving step of retrieving a predetermined number of data pieces in a descending order of the similarity calculated at the similarity calculating step; and a similarity calculating control step of controlling whether the similarity calculating step is to calculate a similarity in the whole space, in accordance with a similarity in a partial space between the data selected at the data selecting step and the test data.
According to a further aspect, the present invention which achieves the object relates to a computer-readable storage medium storing a program for controlling a computer to perform data processing, the program comprising codes for causing the computer to perform; a list forming step of forming a list of data in a database storing a set of data of a vector format, for each component of a vector, the data in the list being arranged in an order of a value of each component; an input step of inputting test data of a vector format; a component selecting step of sequentially selecting each component of the vector format; a data selecting step of sequentially selecting data in an ascending order of a difference value between the data and the test data from the list, for each component of the vector format; a distance calculating step of calculating a distance in a whole space between the data selected at the data selecting step and the test data; a retrieving step of retrieving a predetermined number of data pieces in an ascending order of a distance calculated at the distance calculating step; a completion judging step of judging, from a difference of a component value between one data piece selected at the data selecting step and the test data, whether data selection at the data selecting step is to be continued or terminated; and a distance calculating control step of controlling whether the distance calculating step is to calculate a distance in the whole space, in accordance with a distance in a partial space between the data selected at the data selecting step and the test data.
According to a further aspect, the present invention which achieves the object relates to a computer-readable storage medium storing a program for controlling a computer to perform data processing, the program comprising codes for causing the computer to perform: a pre-processing step of calculating a square of a norm of each data piece in a database storing a set of data of a vector format and forming a list of data arranged in an order of a value of each component of the vector, for each component; an input step of inputting test data of the vector format and operating a metric tensor upon the test data; a component selecting step of sequentially selecting each component of the vector format; a data selecting step of sequentially selecting data in an ascending order of a difference value between the data and the test data from the list, for each component of the vector format; a similarity calculating step of calculating a similarity in a whole space between the data selected at the data selecting step and the test data by using a square of a norm of the data; a retrieving step of retrieving a predetermined number of data pieces in a descending order of the similarity calculated at the similarity calculating step; and a similarity calculating control step of controlling whether the similarity calculating step is to calculate a similarity in the whole space, in accordance with a similarity in a partial space between the data selected at the data selecting step and the test data.
Other objectives and advantages besides those discussed above shall be apparent to those skilled in the art from the description of preferred embodiments of the invention which follows. In the description, reference is made to accompanying drawings, which form a part thereof, and which illustrate an example of the invention. Such example, however, is not exhaustive of the various embodiments of the invention, and therefore reference is made to the claims which follow the description for determining the scope of the invention.