The present invention generally relates to techniques for querying large databases and, more particularly, to query techniques which employ a user/computer interactive process including visual feedback in order to find the best responses to a given query.
In recent years, large databases have become very common because of the ability to store large amounts of information with the current advances in hardware technology. Such databases are often queried by users in order to find specific patterns of behavior. Finding the closest matching object is important for a number of applications. Examples include similarity searches in geometric databases, multimedia databases, and data mining applications such as fraud detection, information retrieval, among numerous other domains. Many of these domains contain applications in which the dimensionality of the representation is very high. For example, a typical feature extraction operation on an image will result in hundreds of dimensions. By way of another example, a data set which is inherently dimensionally high may be a demographic database in which the dimensions comprise information such as the name, age, salary, and other features which characterize a person.
Similarity retrieval problems are reasonably well solved for low dimensional applications for which efficient index structures have been proposed. A wide variety of multidimensional indexes have been proposed which work well for low dimensional data. For a comprehensive overview, see, e.g., V. Gaede et al., xe2x80x9cMultidimensional Access Methods,xe2x80x9d ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998, the disclosure of which is incorporated by reference herein. These structures can support a wide range of queries such as point queries, range queries, or similarity queries to a predefined target.
However, many empirical studies have shown that traditional indexing methods fail in high dimensional spaces, see, e.g., K. Beyer et al., xe2x80x9cWhen is Nearest Neighbors Meaningful?,xe2x80x9d Proceedings of the ICDT Conference, 1999; K. -I. Lin et al., xe2x80x9cThe TV-tree: An Index Structure for High Dimensional Data,xe2x80x9d VLDB Journal, vol. 3, no. 4, pp. 517-542, 1992; N. Katayama et al., xe2x80x9cThe SR-Tree: An Index Structure for High Dimensional Nearest Neighbor Queries,xe2x80x9d Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 369-380, 1997; R. Weber et al., xe2x80x9cA Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,xe2x80x9d VLDB, pp. 194-205, 1998; and S. Berchtold et al., xe2x80x9cThe X-Tree: An Index Structure for High-Dimensional Data,xe2x80x9d Proceedings of the International Conference on Very Large Databases (VLDB""96), Bombay, India, pp. 28-39, 1996, the disclosures of which are incorporated by reference herein. In such cases, almost the entire index is accessed by a single query. In fact, the use of most indexes may be less efficient and/or effective than a sequential scan because of the simplicity of the latter. Furthermore, as recent results in the above-referenced K. Beyer et al. article show, it may often be difficult to determine the meaningfulness of the nearest neighbor for a query point. In fact, in many cases, it has been shown that the distance measurements to the target are not very meaningful with the use of traditional distance functions.
The present invention provides techniques for finding query responses from database queries using an interactive process between a user (e.g., a person entering a query to a database) and a computer system (e.g., a computing system upon which the database resides or which has access to the database). The interactive process comprises providing the user with one or more visual perspectives as feedback on the distribution of points in the database. These visual perspectives may be considered by the user in order for the user to provide feedback to the computer system. The computer system may then use the user-provided feedback to determine the best response to the query.
In one illustrative aspect of the invention, a method for resolving queries in large databases is provided which uses an interactive process. More specifically, in response to a user query to a database, the method comprises providing a user with different views of the data which the user can use in order to vote on the effectiveness of different sets of points which match the query the user entered. These responses may then be aggregated to determine the true sets of responses corresponding to the original user intent.
The conventional fully-automated systems discussed above are incomplete in their characterization of the data in terms of a single best distance function. However, the present invention realizes that different projections can provide different views of the data, all of which may be quite informative to a human in understanding the relationship between a query point and the rest of the data in a database. Furthermore, the present invention realizes that it is often easier for a human to make judgments on the viability of clusters of different shapes and sizes in a given view, or whether a view yields any useful information at all. That is, a computer cannot match the visual insight and intuition of a human in distinguishing useful patterns from the data. On the other hand, a human needs computational support in analyzing large volumes of high dimensional data in order to find the relevant characteristics of the data which relate the xe2x80x9cquery pointxe2x80x9d to the database. It is to be appreciated that, in one example, the query point may refer to a data record specified by a user that is used to locate patterns in a database that are closely related to the data record.
Accordingly, in accordance with an illustrative embodiment of the present invention, a computer provides an iterative visual feedback loop in which carefully chosen views of the data are presented to a user to illustrate the different clusters in the data. Once these views have been determined, then kernel density estimation may be used to provide a visual profile of the distribution of the data. These visual profiles are used to determine the clusters.
It is evident to those skilled in the art that the above method may be used in order to determine either the closest patterns to a given query point or the different clusters in the data. This can be done by finding the patterns close to multiple query points which are chosen randomly. These different patterns form the clusters in the data.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.