1. Field of the Invention
The present invention relates to a data processing apparatus for calculating similarity among vectors and a method therefor. More particularly, the present invention relates to a data processing apparatus for outputting a predetermined number of data elements on the basis of similarity with an inquiry vector, and a method therefor.
2. Description of the Related Art
In a method for calculating similarity among vectors, distance is conventionally used. In general, a problem is often handled such that when a predetermined key vector is given, L number of vectors close to the key vector are extracted from among a predetermined set of vectors. In the setting of such problems, if comparison computations with all vectors are performed, the computational complexity reaches O (MN) with respect to the dimension M of the vectors and the number N of elements of the vectors.
Accordingly, several high-speed algorithms for distance calculations have been provided. The strategies common to these high-speed algorithms aim to convert data into a structured form in advance in order to lessen the computational complexity during distance calculations.
In a method in which, for example, a sorting list is created on the basis of a component value along each axis of a vector, when a vector serving as a key is given, the axes are sorted according to an appropriate priority, and the location of the component value of the key vector in the sorting list of the axis which is at the highest order is specified. Starting in sequence from the vector stored at a nearby position, the distance between the vector and the key vector is calculated on the basis of the ID thereof. Distances to all the vectors must be calculated to obtain accurate results. However, if the sequence for the component value of the selected axis reflects well the actual distance between the vectors, satisfactory results can be expected with a smaller number of calculations.
In this method, only a number of calculations on the order of O (N log2N) for structuring data and on the order of O (L log2N) for comparison computations with L vectors are required. In addition to this method, called a xe2x80x9cprojection methodxe2x80x9d, there are methods using a k-d tree and derivative versions thereof, and the order of the computational complexity during pre-processing and retrieval is nearly the same.
Although the above-described conventional technologies are related to distance calculations of vectors, there is a case in which a norm is effective as a measure for expressing similarity among vectors. For example, in xe2x80x9cA Metadatabase System for Semantic Image Search by a Mathematical Model of Meaningxe2x80x9d, by Kiyoki Y., Kitagawa T., and Hayama T., in SIGMOD RECORD, Col.23, No.4, December 1994 (hereinafter referred to as xe2x80x9cReference 1xe2x80x9d), similarity with context vectors is calculated as described below. That is, a projection operator with respect to the representation space is generated on the basis of a context vector, and the norm of the vector in the subspace extracted by this projection operator is calculated, thereby defining the similarity with the context vector.
In xe2x80x9cHigh-speed Algorithm for Semantic Image Search by a Mathematical Model of Meaningxe2x80x9d, by Miyahara, Kiyoki, and Kitagawa, in an Information Processing Society of Japan Research Report, Database System 113-41, Jul. 15, 1997 (hereinafter referred to as xe2x80x9cReference 2xe2x80x9d), a high-speed algorithm for such similarity calculation has been proposed. This is a direct application of the projection method in the above-mentioned distance calculation. That is, a sorting list with respect to each axis is created in advance. Then, if a context vector is given, the priority of each axis is determined on the basis of the component value of the context vector. Based on this priority of the axis (the priority of the sorting list) and the order in each list, similarity with the context vector is determined. In this method, the number of calculations of preprocessing is on the order of O (N log2N) and the number of comparison calculations is on the order of a number L of data which is output as results.
However, the above-described conventional method of Reference 2 has the problems described below. These are described by referring to FIG. 2 which shows an example of a sorting list created by the conventional method of Reference 2.
In FIG. 2, each numeral indicates the ID number of a vector. Each row represents sorting lists for each individual axis. The nearer to the top of the list a row is, the higher the priority of the corresponding axis, and in the list, the more leftward, the higher the priority of the vector.
In the method of Reference 2, at first, the vector positioned at the highest order of the sorting list of the axis having the highest priority is determined to have the highest similarity, that is, the vector of ID number 10 in FIG. 2. Next, the vector positioned at the second place of the same axis (i.e., same row), that is, the vector of ID number 6 in FIG. 2, is assumed to have the second highest similarity. The vector having the third highest similarity is determined to be the vector positioned at the highest order of the sorting list of the axis having the second highest priority, that is, the vector of ID number 3 in the figure. As described above, in the method of Reference 2, since similarity is determined by the sequence of the position on the list, there is a possibility that, for example, the sequence of ID number 6 and ID number 3 is reversed to that of the actual similarity.
Accordingly, it is an object of the present invention to provide a data processing apparatus and method capable of creating a part of sorting lists of a vector data set based on similarity with a given vector at a high speed.
It is another object of the present invention to provide a data processing apparatus and method capable of retrieving a vector similar to an inquiry vector from a database at a high speed.
According to one aspect, the present invention which achieves these objectives relates to a data processing apparatus for extracting a predetermined number of data elements having a high similarity based on an inner product with an inquiry vector from a set of data in a vector form, the data processing apparatus comprising: a database for storing a set of data in a vector form; list creation means for creating a list of data such that the data of the database is arranged in a descending order of the intensity of each one component of a vector, respectively; input means for inputting an inquiry vector; score calculation means for adding with respect to each data, for all the components, a score based on a numerical value given in a descending order to the place of the list for each of the components, and the component of the inquiry vector corresponding to the component of the data with regard to the inner product; and output means for outputting the predetermined number of data elements on the basis of the score.
According to another aspect, the present invention which achieves these objectives relates to a data processing method for extracting a predetermined number of data elements having a high similarity based on an inner product with an inquiry vector from a set of data in a vector form stored in a database, the data processing method comprising: a list creation step of creating, for each component, a list of data in which data of the database is arranged in a descending order of the intensity of one component of a vector; an input step of inputting an inquiry vector; a score calculation step of calculating with respect to each data, for all components, a score based on a numerical value given in a descending order to the place of a list for each of the components, and the component of the inquiry vector corresponding to the component of the data with regard to the inner product; and an output step of outputting the predetermined number of data elements on the basis of the score.
According to still another aspect, the present invention which achieves these objectives relates to a computer-readable storage medium storing a data processing program for controlling a computer to extract a predetermined number of data elements having a high similarity based on an inner product with an inquiry vector from a set of data in a vector form stored in a database, the program comprising codes for causing the computer to perform: a list creation step of creating, for each component, a list of data in which data of the database is arranged in a descending order of the intensity of one component of a vector; an input step of inputting an inquiry vector; a score calculation step of calculating with respect to each data, for all components, a score based on the numerical value given in a descending order to the place of a list for each of the components, and the component of the inquiry vector corresponding to the component of the data with regard to the inquiry vector; and an output step of outputting the predetermined number of data elements on the basis of the score.
Other objectives and advantages besides those discussed above shall be apparent to those skilled in the art from the description of a preferred embodiment of the invention which follows. In the description, reference is made to accompanying drawings, which form a part thereof, and which illustrate an example of the invention. Such example, however, is not exhaustive of the various embodiments of the invention, and therefore reference is made to the claims which follow the description for determining the scope of the invention.