The problem of information retrieval from databases is becoming more and more important as our world moves further into the information age. Not only are databases becoming larger and more ubiquitous, they are being used to store more complex data types than in the past. This might include such data types as textual documents, images, audio clips, and multimedia files. Often such items can be characterized by vectors of relatively large dimensionality. For example, a set of 200 by 200 pixel images could be viewed as vectors in a 40,000-dimensional space. In other situations, complex data types may not be easily mapped into a vector space, yet can still be characterized as points in a xe2x80x9clarge metric spacexe2x80x9d, in which distances between pairs of data items can nevertheless be computed. In either case, it is desirable to have methods that can quickly search for stored data points which in some sense are good matches for a search query. For example, the query might be an image, and the goal of search might then be to find similar images in the database. Such a search is often called a xe2x80x9cnear neighborxe2x80x9d search, in that we are seeking stored data items which are nearby the query point, in terms of some distance metric.
There are well-known methods for performing near neighbor searches efficiently when the data dimensionality is small, such as 3-dimensional for example. Typically such methods store the data in some type of tree structure, such that a search can proceed by following a path from the tree""s root to a xe2x80x9cleaf nodexe2x80x9d which (hopefully) best matches the query. FIG. 1 illustrates such a structure with a two-dimensional set of data points, lettered a to r. In the figure, the data has been hierarchically clustered into three levels, as indicated by the thickness of the circle representing each point. At the highest level are the data points g, k, and m, which we will suppose were randomly chosen out of the entire data set. These three points are used to partition the rest of the data into three regions, such that any given point is associated with the high-level point it is closest to. The thick lines indicate this separation.
At the next level of the hierarchy, two points have been randomly chosen within each high-level region, and are indicated by medium-thickness circles. For example, e and c were chosen in the xe2x80x9cg regionxe2x80x9d. The rest of the data points are then again split up, this time according to which second-level point they are closest to. The thin lines indicate these sub-regions. Similarly, we could form more levels, as would be done in a typical application, which would have many more data points.
FIG. 2 shows how the data set of FIG. 1 can be viewed as a tree, containing a set of xe2x80x9cnodesxe2x80x9d connected by directed xe2x80x9clinksxe2x80x9d. In this case the tree""s root node does not correspond to any actual data item, but will be used conceptually as the starting point for any search. More generally, prior art methods sometimes use trees in which no nodes except the leaf nodes correspond to actual retrievable data points. Such differences are not relevant for present purposes, however. Also, some of the tree""s nodes contain more than two child nodes; this is done here in order to limit the number of tree levels for explanatory purposes. Prior art methods may in general have any number of children for a given node, but commonly have two or less.
FIGS. 2 and 3 illustrate with dotted lines how a typical search might be conducted within the data set and search tree of FIGS. 1 and 2. In FIG. 3, an additional point, labeled Q is shown, to indicate a hypothetical query point. The goal of search is then to find the data point which is the nearest neighbor of Q. More generally, we might be interested in multiple nearest neighbors, or simply xe2x80x9cnearxe2x80x9d neighbors, that is, approximately nearest neighbors. Starting at the root node, the search first compares Q to each of the first-level nodes. Out of these three, node m is the closest to Q, so it is chosen as the current search node. The child nodes of m are then checked, of which node p is closest to Q, so node p becomes the current node. The children of p are then checked, but neither of them is closer to Q than is p, so the search terminates, with node p as the result.
Note that the result of this particular search, node p, is not actually the nearest neighbor to the query Q. Such a possibility must be considered with virtually any non-exhaustive search method, and can be a problem whenever the query is not identical to any data point. If only an approximately nearest neighbor is required, this may not be a problem. Otherwise, the search must be modified to check additional paths. In the example of FIGS. 1 through 3, it is apparent that a search beginning at node k is necessary in order to find the true nearest neighbor of Q, which is node o. Prior art methods which are tree-based typically use some form of xe2x80x9cbacktrackingxe2x80x9d in order to deal with this problem. In particular, they typically use distance information at each node to decide which branches need to be searched, and which don""t, depending on how likely each branch is to contain the true nearest neighbor(s). This often takes the form of backtracking, because a reasonable approach is to first do a fast search for an exact match, which only needs to follow one root-leaf path, and then subsequently back up and search other branches if an exact match is not found.
When data dimensionality is large, for example tens of dimensions or more, prior art search methods become less efficient. The same is true for non-vector data items having comparable levels of complexity. The most common approach by far in such situations is still a tree-structured database. A problem with this approach is that the number of tree nodes which must be searched increases severely, until in the limit of very large dimensionality, the search is no faster than simple exhaustive search by direct comparison of the query to every database item. There appear to be at least two reasons for this. The first reason is that for any reasonable data set, the search space becomes mostly empty as the dimensionality grows very large. Put differently, the average distance between a data item and its nearest neighbor grows large and, importantly, becomes nearly the same as the average distance between two randomly chosen points. Another way of saying this is that the distribution of interpoint distances becomes very narrow, with increasing dimensionality. Because of this, measuring the distance between a given point and the query provides very little information as to which other points will be closest to the query. This is a well-known problem, and is often called the xe2x80x9cdimensionality cursexe2x80x9d.
However, while the dimensionality curse seems to be an insurmountable problem, there is an additional problem with tree structures, which has not been addressed by the prior art. In particular, conventional tree structures are too rigid to allow a fast search, even when the dimensionality curse does not apply. A tree only divides up the search space in one way, and any given data item is associated with only one branch of the tree. This rigid division implies that there will be fixed boundaries between sub-regions (as shown in FIG. 1). Moreover, whenever a query point falls near such a boundary, the search procedure will typically need to check the sub-regions on both sides of the boundary. This, of course, makes the search less efficient for such query points. Furthermore, as the effective dimensionality of the space increases, the probability that a query will be near a boundaryxe2x80x94and indeed, near many boundariesxe2x80x94increases dramatically. Because of this, a typical tree search in a large-dimensional space is required to check many branches. This problem is a direct result of the use of a strict tree structure, and occurs even when the space is not xe2x80x9cmostly emptyxe2x80x9d as described above. I will call this the xe2x80x9crigid hierarchy problemxe2x80x9d, to distinguish it from the more general dimensionality curse problem which results from empty space.
There are many variations of tree search methods in the prior art. For example, xe2x80x9cvantage point treesxe2x80x9d structure the data by placing an item in one branch if it is within a critical distance from its parent item, and another branch if it is beyond that distance. So-called xe2x80x9cgeneralized hyperplane treesxe2x80x9d are similar, but place an item in one of two branches, according to which of two reference points it is closer to. More sophisticated methods, such as Brin""s GNATs, allow an arbitrary number of such reference points at each branch. (See xe2x80x9cNear Neighbor Search in Large Metric Spacesxe2x80x9d, by Sergey Brin, Stanford University, Department of Computer Science, 1995, pp. 1-11). However, one thing these methods all have in common is that they use a strict tree structure of some sort, in which there is only one path from a given leaf to the tree""s root. Because of this, they all suffer from the rigid hierarchy problem just described.
A further disadvantage of these prior art methods is that a near neighbor search is not significantly faster when the query is a known data point than when it is a random point. In other words, even when the query is known to be identical to a stored data point, and the location of that stored point is known, a search for near neighbors (other than the stored point itself) must proceed essentially as if the query were an arbitrary, unknown pointxe2x80x94in particular, by starting a new search at the tree""s root. A consequence of this is that user feedback cannot be easily used to speed up subsequent searches. For example, a user might flag a particular search result item as being highly relevant to his/her needs, and request other items which are similar to it. Prior art tree methods would not be able to perform such a constrained search significantly faster than an entirely new search. More generally, they cannot easily use previous search results to speed up a subsequent search, even if the new query is similar to the previous query. This can be viewed as another aspect of the rigid hierarchy problem, in that it results from the use of a strict, inflexible partitioning of the database.
So-called xe2x80x9csemantic networksxe2x80x9d are a type of link-based data representation, which are not limited to a tree shape, but rather can be arbitrary graphs (i.e. xe2x80x9cnetwork shapedxe2x80x9d). However, semantic networks have not been applied to the general problem of nearest neighbor search. Rather, they are typically used to find xe2x80x9cinterestingxe2x80x9d relations between specified existing nodes of the network, in the form of paths through the network which connect the nodes.
Furthermore, a semantic network search requires specification of one or more stored nodes, as opposed to an arbitrary new item description (as in the general near neighbor search problem). For example, if stored nodes representing xe2x80x9cpenguinxe2x80x9d and xe2x80x9costrichxe2x80x9d were specified, a semantic network search might return a reference to a xe2x80x9cbirdxe2x80x9d node, representing an interesting implicit relation between the specified nodes. However, even a search such as thisxe2x80x94which would not normally be used to find near neighbors in any casexe2x80x94depends on the xe2x80x9cqueryxe2x80x9d nodes already being stored. Thus a search which specified the prehistoric bird xe2x80x9carchaeopteryxxe2x80x9d would produce no result, unless a node and corresponding link relations had already been stored for the xe2x80x9carchaeopteryxxe2x80x9d concept.
Finally, it should be noted that certain structures and methods have been proposed for large-dimensionality searching which are specialized to solve one aspect of the problem at the expense of ignoring other aspects. For example, the OPT-Trees mentioned by Brin (as referenced above) allow for a very small number of distance computations during search, but at the expense of having to traverse multiple lists of N pre-computed distances (N being the number of data points). Because N is often large in applications of interest, such a method, which scales at least linearly in N (as does exhaustive search), is typically not practical for such applications. Similarly, other methods might speed up search at the expense of requiring storage of the order N*N or worse. Again, such methods, while interesting and possibly useful for small databases, do not scale well to large databases and thus have limited applicability.
It has not been generally recognized in the prior art that the problems of dimensionality and rigid hierarchy are separable. Rather, it seems that when search performance in prior art methods gets worse with increasing dimensionality, researchers attribute this only to the dimensionality curse. Since the dimensionality curse is probably unavoidable, they conclude that greatly improved database searching methods are not possible. Since the rigid hierarchy problem is separate, though, there may nonetheless be methods which significantly improve upon the prior art, even within the bounds imposed by the problems of the dimensionality curse.
The present invention solves the aforementioned needs. It recognizes that the rigid hierarchy problem is separate from the dimensionality curse problem. It provides a computer-implemented system and method for allowing fast near neighbor searches in databases that represent large metric spaces.
The present invention provides a search method, which finds a near neighbor to a query with fewer distance computations, on average, than comparable prior art tree structured methods. The search method can make use of previous search results to speed up subsequent searches on similar queries. The search time in the present search method scales better than linearly with the number of data items. The storage requirements for the search method are linear in the number of data items.
The present invention allows fast near neighbor searches in large metric space databases where the data elements in the database are high dimensional and each data element represents a point in a large metric space. Given a query item, which also represents a point in the large metric space, the invention finds one or more data items in the database which are near neighbors of the query item. The invention first preprocesses a set of data items, by computing distances between pairs of items and storing links between pairs which are xe2x80x9cnearxe2x80x9d one another. In general, a given item will link to more than one other item, but will link to a small number of items, relative to the total number of items. The set of all links can be viewed as imposing a network structure on the database.
Search of the database proceeds by following links from item to item, in particular preferentially following links to items which are nearest the query Q. In one embodiment, the search terminates upon reaching an item R which is closer to Q than are all the items to which R links.
In one embodiment, the search process is hierarchical, although a hierarchical search process is not required. In a preprocessing stage, a subset of the data items is selected, and links are created between near neighbors within the subset; these links are in addition to the links created within the database as a whole. At search time, a xe2x80x9ccoarsexe2x80x9d search is first done within the selected subset, using only the links within the subset. This may be viewed as searching the highest level of the hierarchy, which contains fewer data items than the entire database, and serves to xe2x80x9cnarrowxe2x80x9d the search. The method then conducts a xe2x80x9cfinexe2x80x9d search within the entire database, using the entire set of links, starting at the item which was the result of the coarse search. Such a two-level search process can readily be expanded to more levels, to accommodate very large databases. The hierarchical search method is less susceptible to xe2x80x9clocal minimumxe2x80x9d problems than would be a xe2x80x9cflatxe2x80x9d, single-level search.
The present invention comprises a computer implemented method for performing near neighbor database searches. It uses a database, which represents a large metric space having data items, where each data item in the database represents a point in the large metric space. The method selects data items to form a subset. It designates a data item in the subset as a current data item, and computes a distance between the current data item and at least one other data item in the subset. Using the computed distances, at least one near neighbor data item for the current data item is found, and a link is created from the current data item to each near neighbor data item. The steps of designating a data item and using the computed distances are repeated for all data items within the subset. At least one data item from the subset is selected to form a current search set. For all data items in the current search set, a distance is computed between each data item and a query. A distance is computed between the query and a near neighbor data item linked by a data item in the current search set, this linked near neighbor data item is added to the current search set if an addition criterion is met and the computing a distance and adding the linked near neighbor data item steps are repeated until a search criterion is met. The search criterion may be met when a specified amount of processing has occurred, when the item in the current search set closest to the query is within a user-specified distance from the query or when the item in the current search set closest to the query is within a default distance from the query.
In one embodiment, the invention comprises (a) using a database which represents a large metric space having data items, where each data item in the database represents a point in the large metric space, (b) designating all the data items in the database as an initial current subset (c) selecting data items from the current subset to form a new current subset, (d) designating a data item in the current subset as a current data item, computing a distance between the current data item and at least one other data item in the current subset, and using the computed distances, finding at least one near neighbor data item for the current data item, creating a link from the current data item to each near neighbor data item and (e) repeating step d until each data item within the current subset has been designated as the current data item. Steps (c) through (e) are repeated at least once.
The method further comprises (a) setting the current subset level to a selected subset, (b) selecting at least one data item from the current subset level to form a current search set, (c) for all data items in the current search set, computing a distance between each data item and a query, (d) computing a distance between the query and a near neighbor data item linked by a data item in the current search set and (e) adding this linked near neighbor data item to the current search set if an addition criterion is met. Steps (d) and (e) are repeated until a search criterion is met.
In an alternative embodiment, the present invention comprises a computer implemented method for finding a data item nearby a query point comprising (a) using a database with data items, where each data item in the database represents a point in a large metric space, (b)creating links between data items that are near neighbors (c) selecting a data item to be the current search item, (d) computing a distance between the current search item and the query point, (e) computing a near neighbor distance between the query point and a near neighbor item linked by the current search item, (f) if the near neighbor distance is less than the distance between the current search item and the query point, selecting the near neighbor item as the current search item and (g) repeating steps (e) and (f) until a search criterion is met. The search criterion may be met when the smallest distance between any near neighbor items and the query point is greater than the distance between the current search item and the query point. When the search criterion is met, the current search item is selected as a result and the result is an approximately nearest neighbor of the query.
In an alternative embodiment, the present invention comprises a computer implemented method for performing near neighbor database searches comprising (a) using a database which represents a large metric space having data items, where each data item in the database represents a point in the large metric space, (b) computing the distance between data items, (c) storing a link, which is a pointer to another data item, for data items that are near one another; (d) designating an item to be a current search item, (e) computing a distance between the current search item and a query point in the large metric space, (f) computing a distance between the query point and linked items to which the current search item links, (g) if the distance from the query point to the linked item is less than the distance between the query point and the current search item, designating the linked item as the current search item and repeating steps (f) and (g), and (h) if the distance from the query point to the linked item is greater than the distance between the query point and the current search item, designating the current search item as a search result.
The present invention comprises a data structure for storing database items and their links to near neighbor database items comprising a table containing an entry for each of a plurality of data items from a database, each entry comprising the data item and a set of pointers to the near neighbor database items for each subset of data items within the database.
The computer implemented methods are embodied in software programs that may be stored on a computer-readable medium.