In information or document retrieval (referred herein to as “IR”), a query is applied to a set of documents, such as books or articles, to retrieve relevant documents. Automated ranking of the results of a query is popular in information retrieval. In contrast, database queries return unordered sets of tuples, or require a ranking function to be explicitly specified by the user, e.g., using an ORDER BY clause. It is often impossible for an exploratory user (e.g., a data analyst, or a customer browsing a product catalog) to cope with large unordered result sets. Such a user may not be able to specify an explicit ranking function that is appropriate for the application.
Extracting similarity functions for ranking has been investigated in areas outside database systems. For example, similarity functions have been used for ranking in document or information retrieval. One similarity function used in information retrieval is cosine similarity. In cosine similarity for information retrieval, a document is modeled as a vector of words and the similarity is defined as the dot-product between two documents. Cosine similarity for information retrieval has been enhanced by term frequency-inverse document frequency normalization techniques, which assign different importance to words based on the frequencies of their occurrences within the document collection.
Research in web search engines has influenced ranking techniques. In particular, web search engine research has influenced the analysis of link structures and network topologies in addition to page content.
Existing systems for ranking database queries typically require additional external information, such as user input or training data. Systems referred to as MARS and FALCON employ content-based techniques for retrieval in multimedia databases. In both systems, the user can specify one or more positive examples of objects, and the system attempts to retrieve similar objects through an iterative process of relevance feedback from the user. FALCON differs from MARS in that it generalizes to any metric distance function between objects, while MARS relies on vector spaces. Both systems are primarily designed for numeric multimedia databases, and learn similarity concepts that are used for ranking through relevance-feedback from the user.
D. Wilson and T. Martinez, Improved Heterogeneous Distance Functions, Journal of AI Research, 1997 proposes distance functions for heterogeneous data (both categorical and numerical). The methods disclosed by Wilson and Martinez are mostly useful for classification applications and require the data to be accompanied with class labels.
Ranking is an important component in collaborative filtering research, especially in the design of recommender systems. In collaborative filtering, the objective is to predict the utility of items in a database to a particular user based on a database of user preferences. These methods require training data containing queries and their ranked results.
W. Cohen, Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, SIGMOD, 1998 discloses a query language that introduces an operator for textual attributes. The Cohen paper also uses inverse document frequency ideas from information retrieval in a non-ranking application.
Some research has been done on clustering categorical databases based on co-occurrence analysis. The idea of co-occurrence is that two values of a categorical attribute are deemed similar if they often co-occur with the same values of other attributes.
Top-K techniques exist that, given an explicit similarity (or distance) function that satisfies certain monotonic properties, the techniques efficiently retrieve the top-K tuples from a database.
There is a need for a system that automatically extracts an appropriate similarity function from a database, ranks records by relevance to a given query and returns the relevant records in a ranked order.