This invention relates to a data processing apparatus and method for calculating the degree of similarity between vectors and, more particularly, to a data processing apparatus and method for outputting a prescribed number of items of data based upon degree of similarity with a query vector.
In general, distance is used as a means of calculating the degree of similarity between vectors. When a certain key vector has been given, a problem often dealt with is the extraction of L-number of vectors close to the key vector from a set of predetermined vectors. If a comparison operation with respect to all vectors in the set is performed, the amount of computation is on the order of O(MN), where M represents the number of vector dimensions and N the number of elements in the set of vectors.
Accordingly, several high-speed algorithms for distance computation have been proposed. The strategy common to these high-speed algorithms is to structuralize the data in advance and reduce the complexity of computation when distance is calculated.
For example, a method of creating sorting lists based upon the values of components along the respective axes of vectors involves sorting axes by an appropriate priority when a key vector has been given and specifying where the component values of the key vector are located in the sorting list of the axis situated at the highest ranking position. Then, on the basis of the vector IDs, the distances between the vectors and the key vector are calculated in order starting from a vector that has been stored at a nearby position. In order to obtain accurate results, the distance between all vectors and the key vector must be calculated. However, if the order relating to the component values of the selected axes approximately reflects the actual distances between the vectors, then satisfactory results can be attained with a small amount of calculation.
According to this method, an amount of calculation on the order of O(Nlog.sub.2 N) suffices for preprocessing (structuralization of the data) and on the order of O(Llog.sub.2 N) for the comparison with L-number of vectors. Though there are a K-d tree and its derivative versions in addition to this method, which is referred to as a projection method, the orders of the amounts of calculation for the preprocessing and the comparison operation are the same.
The prior art described above concerns the calculation of distances between vectors. However, there are cases where a norm is effective as a gauge for expressing the degree of similarity between vectors. For example, in "A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning", by Y. Kiyoki, T. Kitagawa and T. Hayama, SIGMOD RECORD, Vol. 23, No. 4 (December 1994) (referred to as "Reference 1" below), the degree of similarity with a context vector is calculated as follows: A projection operator with respect to representation space is generated based upon a context vector, and the norm of a vector in subspace, which has been extracted by this projection operator, is calculated, thereby defining degree of similarity with the context vector.
Further, in "A High-speed Algorithm for a Semantic Image Search by a Mathematical Model of Meaning", by Miyahara, Shimizu and Kitagawa, Research Report of the Information Processing Society of Japan, Database System 113-41 (Jul. 15, 1997) (referred to as "Reference 2" below), a high-speed calculation algorithm for such calculation of degree of similarity is proposed. This is a direct application of the projection method in the calculation of distance. More specifically, sorting lists concerning each of the axes are created in advance. When a context vector is given, the priorities of the axes are decided based upon the component values of the context vector. The degree of similarity with the context vector is decided based upon the priorities of the sorting lists and the ranking in each list. With this method, the amount of calculation in preprocessing is on the order of O(Nlog.sub.2 N), and the comparison calculation is on the order of the number L of items of data output.
The conventional method of Reference 2 involves certain problems. This will be described in conjunction with FIG. 2, which is a diagram illustrating an example of a sorting list created according to the conventional method of Reference 2.
In FIG. 2, each member represents vector ID. Each row forms a sorting list and rows are sorted from the top to the bottom in accordance with priority of axes. Also, in each sorting list, vector IDs are listed from the left side to the right side in accordance with its rank.
According to the method of Reference 2, first it is judged that a vector situated at the top of a sorting list of an axis for which the priority is highest has the highest degree of similarity. In FIG. 2, this vector has the ID number 10. Next, the vector situated at the second rank of the same axis, namely the vector whose ID number is 6 in FIG. 2, is judged to be the vector having the second highest degree of similarity. The vector having the third highest degree of similarity is decided upon as being the vector situated at the top of the sorting list of the axis having the second highest priority. This vector has the ID number 3 in FIG. 2. Thus, with the method of Reference 2, degrees of similarity are decided in the order of position in the lists. As a result, there is a possibility that the order of ID number 6 and ID number 3 will be the reverse of that of the actual degrees of similarity.