In recent years, with formation of a database of multimedia information of text, image, sound, and the like, and spread of a POS system, and the like, a technique for efficiently executing search, classification, tendency analysis, and the like of a vector database of an assembly of several hundreds of thousands to several millions of pieces of vector data of several tens to several hundreds of dimensions has intensively been researched/developed in computer systems such as a multimedia database system and a data mining system.
For example, with a newspaper article database, for the database in which a large number of pieces of newspaper article data are accumulated, a dictionary of w words is used to extract an appearance frequency fk of each word k in the dictionary from each newspaper article, and each newspaper article is represented as a set of an identification number i and W-dimensional real vector (f1, f2, . . . , fw). This vector is converted by a main component analyzing technique, and main N (N<W) components are obtained and used as vector data. An inner product of the vector data corresponding to the designated newspaper article, and a vector corresponding to another newspaper article in the database is calculated, the newspaper article having the vector with a largest inner product is obtained, and high-precision similar article search is possible. U.S. Pat. No. 4,839,853 discloses a document searching method in which such vector data is used.
Moreover, with a photograph database, each photograph data is subjected to a two-dimensional Fourier transform with respect to the database in which a large number of pieces of photograph image data are accumulated, and main N Fourier components are obtained as the vector data by extracting fk and representing each photograph data by a set of a photograph number i and N dimensional real vector (f1, f2, . . . , fw). A distance (size of a difference between two vectors) between the vector data corresponding to the designated photograph and the vector corresponding to another photograph data in the database is calculated, and photograph data having the vector with a smallest distance is obtained, so that high-precision similar photograph search is possible. Furthermore, for example, several pieces of typical photograph data belonging to each of different categories such as “portrait”, “landscape photograph”, and “close-up photography of a flower” are presented as classification conditions, an average characteristic vector of each category is calculated, and the category of the characteristic vector with a shortest distance is assigned to each photograph data vector, so that remaining photograph data can automatically be classified into the aforementioned three categories.
Since an efficient similar searching method of a remarkably high-dimensional vector of several tens to several hundreds of dimensions is necessary for such use, various methods have been researched. For example, a high-dimensional vector index preparing method and similarity searching method using a multidimensional searching (SR) tree are disclosed in “The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries” Proceedings of the SIGMOD '97, ACM (1997) by Norio Katayama and Shinichi Satoh. Moreover, a high-dimensional vector index preparing method and similarity searching method based on Boronoi division are disclosed in “Near Neighbor Search in Large Metric Spaces”, Proceedings of the VIDB'95, Morgan-Kaufman Publishers (1995) by Sergey Brin. Furthermore, a high-dimensional vector index preparing method and similarity searching method based on data partitioning technique called “pyramid technique” are disclosed in “the Pyramid-Technique: towards Breaking the Curse of Dimensionarity”, Proceedings of the SIGMOD'98, ACM (1998) by Stefan Berchtold, Christian Bohm and Hans Kriegel.
However, these conventional vector index preparing method and similar vector searching methods have problems that any one of the following four conditions is not satisfied, and the methods cannot broadly be applied to broad-range applications.
1) High-speed search is possible even when the vector is of several hundreds of dimensions.
2) During similarity searching, either one of two types of similarity of the distance between the vectors and the vector inner product can be selected.
3) The similarity searching of “obtaining L vectors having most similarity” can be performed. Furthermore, even when L is relatively large (several tens to several hundreds), a search processing is not excessively delayed.
4) A similarity search range such as “inner product of 0.6 or more” can be designated.
5) A calculation amount required for index preparing is in a practical range (i.e., the index can be prepared in a time proportional to a vector data amount n, or a n*log(n) time).
Concretely, the method using the SR tree does not satisfy the above 1), 2), the method based on Boronoi division does not satisfy 2), 5), and the method using the pyramid technique does not satisfy 2), 3).
A vector index preparing method, similar vector searching method, and apparatuses for the methods of the present invention solve these problems of the conventional technique. A high-dimensional vector is decomposed to a plurality of partial vectors, and a direction and size of each partial vector are represented and recorded by a set of a belonging region number defined by a center vector, an angle (declination) formed with the center vector, and a norm division indicating a norm. Therefore, a search object range of the vector index can precisely be limited even for any query vector. When a difference between a partial inner product lower limit value (upper limit value of a partial square distance) and an actual partial inner product (partial square distance) is accumulated, an efficient search result by a branch limiting technique can be defined. Therefore, the vector index preparing method and similar vector searching method are provided which satisfies all of the above 1) to 4) and which can be applied to a broad range application.
To solve the aforementioned problem, according to a first aspect of the present invention, there are provided a vector index preparing method and apparatus comprising: means for calculating a partial vector; means for tabulating a norm distribution and preparing a norm division table; means for calculating a region number; means for tabulating a declination distribution and preparing a declination division table; means for calculating a norm division number; means for calculating a declination division number; means for calculating index data; and means for constituting an index. Thereby, even when the vector is of several hundreds of dimensions, a high-speed search is possible with respect to a vector database having unclear direction and norm distribution. During similarity searching, either one of two types of similarity of a distance between vectors and a vector inner product can be selected. The similarity search of a type such that “most similar L vectors are obtained” can be performed. Furthermore, even when L is relatively large (several tens to several hundreds), a search processing is not excessively delayed. A similarity search range such as “inner product of 0.6 or more” can be designated. Additionally, a calculation amount required for index preparation is in a practical range. Such vector index can effectively be prepared.
Moreover, in addition to the first aspect, the vector index preparing method and apparatus according to a second aspect of the present invention further comprise means for calculating a component division number. Thereby, in addition to the effect of the first aspect, an effect is produced that a calculation error by quantization of a component is minimized and a capacity of the vector index to be prepared can remarkably be reduced.
Furthermore, according to a third aspect of the present invention, there are provided a similar vector searching method and apparatus comprising: means for calculating a partial query condition; means for preparing a search object range; means for searching an index; means for calculating an inner product difference upper limit; and means for determining a similarity search result. An accumulated value of a partial inner product difference is calculated and used as a clue to a similarity search. Thereby, even when the vector is of several hundreds of dimensions, a high-speed search is possible with respect to a vector database. The similarity search of the type such that “most similar L vectors are obtained” can be performed. Furthermore, even when L is relatively large (several tens to several hundreds), a search processing is not excessively delayed. A similarity search range such as “inner product of 0.6 or more” can be designated. Additionally, a similar vector search using the inner product as a similarity measure is effectively possible.
Moreover, according to a fourth aspect of the present invention, there are provided a similar vector searching method and apparatus comprising: means for calculating a partial query condition; means for preparing a search object range; means for searching an index; means for calculating a square distance difference upper limit; and means for determining a similarity search result. An accumulated value of a partial square distance difference is calculated and used as a clue to the similarity search. Thereby, even when the vector is of several hundreds of dimensions, a high-speed search is possible with respect to the vector database. The similarity search of the type such that “most similar L vectors are obtained” can be performed. Furthermore, even when L is relatively large (several tens to several hundreds), to the search processing is not excessively delayed. The similarity search range such as “inner product of 0.8 or less” can be designated. Additionally, the similar vector search using a distance as the similarity measure is effectively possible.