1. Field of the Invention
The present invention relates to information similarity retrieval and more specifically to a system and method for retrieving points or a collection of points close to a point specified in a multidimensional space using a distance index.
2. Description of the Related Art
In the field of computers, a process, called similarity retrieval, is frequently performed. The similarity search refers to a process of searching for information which is similar to or matches certain information. An example of such a process is to, when looking for a handbag, look for a picture of the handbag by indicating a picture of it. Another example is to, when a fingerprint is presented, search for a person having a fingerprint that matches or is similar to it, i.e., to search for a suspicious person.
In general, the similarity retrieval is performed by extracting two or more features (e.g., colors and shapes) from information (in the above examples, picture or fingerprint). That is, the features are represented as points in a multidimensional space. The distance between points is defined to conform to similarity, that is, so that the higher the similarity, the shorter the distance is. Similarity retrieval of a certain specified item of information is frequently replaced by a process of determining a point or a set of points close to a specified point in multidimensional space. The process of searching for places geographically close to each other just falls under this category. We shall refer to this, too, as similarity retrieval.
The multidimensional space is also referred to as a vector space. Use is also made of a space having a considerable number of dimensions. In the similarity retrieval, each dimension of a space represents a respective corresponding one of features (feature parameters) of information. A typical example is an n-dimensional Euclidean space (R^n where R is a set of real numbers and x^y represents the y-th power of x), which includes {0, 1}^n and N^n (N is a set of natural numbers). {0, 1} is used to represent the presence or absence of features. N is used when the features are each represented by the number.
If the distance has been defined, each dimension need not represent the same kind of set. For example, as in the case of a space in which R, {0, 1} and N are mixed, each dimension may represent a different kind of set. The most commonly used is a multidimensional space in which each dimension represents the same kind of set. A point within the multidimensional space is also referred to as a vector; however, in the description which follows, we shall use the term “point” for ease of understanding.
In the similarity retrieval there are two representative requirements; one is to obtain the k number of information items most similar to a certain specified item of information (in terms of distance, k points closest to a certain specified point), and one is to obtain information items having similarities above a certain value (in terms of distance, points within a certain distance).
In the case of fingerprint, the first requirement is to search for k most suspicious persons and the second requirement is to search for suspicious persons when the degree of suspicion is represented by a certain distance. In the former case, there is a possibility that persons who are not suspicious may also be retrieved. In the latter case, the results of retrieval become null when there is no suspicious person.
The similarity retrieval has been used in various media and have found extensive applications. Some examples of applications of the similarity retrieval are provided below.
(1) Image: Retrieval of images similar to a specific image. For example, as images similar to a specific image in which the sky came out, images in which the sky is expected to have come out are retrieved.
(2) Voice: Retrieval of voices similar to a specific voice. For example, based on uttered voice, a person who uttered the voice is identified.
(3) Text: Retrieval of text that contains a specific keyword.
(4) Character: Recognition of handwritten characters. That is, a determination is made as to which character a handwritten character is the most similar to.
(5) Map: Retrieval of tourist spots near a specific station.
As for the similarity retrieval problem, various approaches have been proposed heretofore. The most straightforward approach to the problem will be to, when a point is designated in a multidimensional space, determine the distances from all points in the space to the designated point, sort the points according to their distance to the designated point, and determine k points closest to the designated point or points within a specific distance. That is, this approach is a dynamic method.
However, this dynamic method suffers from limitations when objects to be retrieved increase in number. For example, when objects to be retrieved contain a million items of information, storing their feature parameters into main storage, calculating their distance and sorting them according to their distance involve an overwhelming number of times data are input and output. For this reason, many static methods have been considered which involve creating an index that allows for high-speed similarity retrieval before the similarity retrieval is performed. The two typical examples are described below.
Each of those methods involves creating a hierarchically structured index and dividing a multidimensional space into hierarchically related regions to restrict the range of retrieval, thereby increasing the speed of retrieval.
(1) R-tree
This method is a natural extension of a B-tree, which is the well-known means for indexing one-dimensionally ordered data. A set of vectors (points) is represented by a minimum rectangular parallelepiped that encompasses it and a hierarchical structure like B-tree is created based on that rectangular parallelepiped, which is referred to as MBR (Minimum Bounding Rectangle).
The R-tree is a height-balanced tree (each leaf has the same height) and, like the B-tree, has an excellent property of allowing access to each element in the same input/output count. It has also an excellent dynamic characteristic and, even with update processing added, will not require a large amount of time for that processing. In addition, since the tree is balanced, the update processing will not seriously degrade the performance.
(2) Quadtree
This method involves dividing a space regularly according to a predetermined ratio. Thereby, the space is divided into nonoverlapping independent subregions. If among the subregions are ones that contain vectors, the division is performed recursively, so that the space is indexed. The method is also used for image coding.
Further, various improved versions of the R-tree and quadtree have been proposed (Volker Gaede et al., “Multidimensional Access Methods”, ACM Computing Surveys, Vol. 30, No. 2, pp. 170-231, June 1998).
Also, some indexes using the distances from a certain reference point have been proposed. Two examples of such indexes are listed below.
(1) Distance Index
This method involves selecting a point from a set of object points to be retrieved and indexing according to the distance from that point (W. A. Burkhard et. Al., “Some Approaches to Best-Match File Searching”, Communications of the ACM, Vol. 16, No. 4, pp. 230-236, 1973). The distance is handled as an integer.
(2) Hierarchical Distance Index
This method provides a hierarchically structured distance index (T. Bozkaya et. al., “Distance-based Indexing for High-dimensional Metric Spaces”, Proceedings of ACM SIGMOD, pp. 357-368, 1997). In general, a plurality of reference points are prepared.
However, the conventional similarity search has the following problems:
(1) System Simplicity
An extension to the specifications of SQL (Structured Query Language) has made database systems, particularly relational databases, more complex (Surajit chaudhuri et. al., “Rethinking Database System Architecture: Towards a Self-tuning RISC-style Database System”, Proceedings of the 26-th International Conference on Very Large Databases, 2000). The functions of database systems have been increased and the optimization thereof has been made complex, so that maintenance, management and performance prediction are becoming difficult. In addition, the management cost and maintenance cost have increased. For this reason, the demand has increased for simplifying the database systems.
Conventionally, the B-tree has been used as the database indexing method. The addition of a multidimensional index to the B-tree will further increase complexity. Even with the B-tree alone, the optimization for join and select processing has increased in complexity. The addition of multidimensional indexes more complex than the B-tree will make the complexity more severe.
It is therefore desirable to create a multidimensional index through a technique, such as a B-tree, which has already been put to practical use.
(2) High-speed Performance
In many cases the similarity retrieval involves searching through very many objects, requiring high-speed performance. Even if the aforementioned simplicity has been achieved, a time-consuming system will not be put to practical use. Accordingly, it is required to achieve both simplicity and high-speed performance.
(3) Space Efficiency
In the case of a multidimensional index, the amount of disk space required increases with increasing number of dimensions. The requirement of large disk space has an effect on the data input/output count. Since the data input/output operation is a heavy process, the overall performance is affected. For this reason, it is desired that the required capacity be as little as possible.
(4) Adaptability to High Dimensions
It is not seldom that the similarity retrieval involves tens or hundreds of dimensions. In a sense it is natural and unavoidable that as the number of dimensions increases, more time and more space are required. It is therefore desired to minimize these problems and to allow practical application to high dimensions.