A system that processes media data such as image data and speech data is widely used in recent years. A demand for such a system is increasing from the aspects of functions and requirements, for example, increase of the processing capability of computers, development of analysis techniques, and a request for safety in monitoring systems and the like. Such a media data analysis system executes processes such as extraction of a region including a person from media data and conversion of speech data into text data.
Search of similar data is a function regarded as a task for a system that processes media data. A media data analysis system needs to retrieve data similar to query data from a large amount of stored data, in a use case such as search of a similar image and determination of similarity of a speech. Similar data mentioned herein is defined for each data type, and a method for determination and a threshold of similarity vary.
A tree-structure index generally used for lower dimensions may cause a problem for search of similar data in high-dimensional data like media data. An index of tree structure like a KD-Tree is commonly used as a multidimensional index. A tree-structure index enables search in log order, which is effective enough when the dimension is not high. However, in a case where a tree-structure index is used for high-dimensional data, there arises a problem that divided regions become sparse and efficient search cannot be performed. This problem is pointed out in Non-Patent Document 3, for example.
Further Non-Patent Document 1 suggests a MLR index which is an index having ring structure, for similarity search in high-dimensional data. This MLR index can be used in a metric space defined by data points and a distance function. Use of this index structure makes it possible to efficiently find similar data with respect to a query point with a certain probability.
In the MLR index, each ring point has a ring structure that is layered according to distances. The layered ring has a structure that the radii of the rings exponentially increase from the center. Each ring contains (k+1) ring points as shown in FIG. 1 of Non-Patent Document 1. Among the ring points, k ring points are reference points to be compared and/or searched. Search of a similar point is performed by comparing distances between the reference points and a query point.
The remaining one ring point other than the reference points is a spare point. The spare point is a point that, unlike the reference points, does not become the target for comparison at the time of search but becomes a candidate at the time of selection of the reference points from the ring points. Such a structure enables search of a point similar to a query point in log order.
Search with a ring-structure index is performed in the following manner. First, a data point is selected at random, and a ring-structure index that the data point has is selected. Next, a distance between a query point and the data point is calculated, and a ring including the query point is selected based on the distance. Then, distances between the query point and the respective reference points within the selected ring are calculated, and a reference point that is the nearest to the query point is selected as a nearest neighbor reference point. In a case where the distance between the nearest neighbor reference point and the query point is more than the distance between the data point in the center and the query point, the data point is returned as a similar point. On the other hand, in a case where the distance between the nearest neighbor reference point and the query point is within a predetermined distance, the reference point is returned as a similar point. In a case other than the above cases, the nearest neighbor reference point is selected as a data point, and the search process is continued. Because the distance gradually becomes short, it is apparent that the data points converge when the number thereof is finite.
For creation of a ring-structure index used for this search, a method for selecting k reference points contained in a ring is important. For example, in a case where a method for selecting reference points is inappropriate and a reference point that is sufficiently near a query point cannot be found, a similar point cannot be found. Therefore, it is necessary to select k reference points from (k+1) ring points so that a distance between any query point and a nearest neighbor reference point becomes small. For realizing this, in Non-Patent Document 2, reference points are selected so that, defining “dij” as a distance between a reference point i and a reference point j, the volume of a k-dimensional polyhedron composed of k k-dimensional point [di1, di2, . . . , dik] the origin becomes maximum. The structure of this index is based on the structure used in Non-Patent Document 2, and methods for searching an index and constructing an index are described in detail in Non-Patent Document 2.
On the other hand, the amount of data to be searched is continuously increasing in recent years, and expansion of database by scale out is required. The amount of media data is large. Therefore, in the case of intending to process a large amount of data by expanding the capability of a server, namely, by scale up, there is no server that can process or, even if such a server exists, it will be extremely expensive. Accordingly, it is necessary to process a large amount of data by making a plurality of servers operate in parallel, namely, by scale out.    Non-Patent Document 1: Rahul Malik, Sangkyum Kim, Xin Jin, Chandrasekar Ramachandran, Jiawei Han, Indranil Gupta, and Klara Nahrstedt, “MLR-index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions,” SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management, Springer-Verlag Berlin, Heidelberg, Jun. 2-4, 2009, pp 167-184    Non-Patent Document 2: Bernard Wong, Aleksandrs Slivkins, Emil Gun Sirer, “Meridian: A Lightweight Network Location Service without Virtual Coordinates,” SIGCOMM '05 Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, ACM New York, N.Y., USA, Aug. 22-26, 2005, pp 85-96    Non-Patent Document 3 Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, “When is nearest neighbor meaningful?,” ICDT '99 Proceedings of the 7th International Conference Database Theory, Springer-Verlag London, UK, January 10-12, Lecture Notes in Computer Science, 1999, Volume 1540/1999, pp 217-235
However, distributed storage of the index suggested in Non-Patent Document 3 as it is into a plurality of servers causes a problem that communication between the servers increases and the efficiency of search of data decreases. This is because, in a case where the data is distributed stored as it is and reference points are held in different servers, communication occurs at the time of tracing the reference points, and the cost of communication increases.