A popular aspect of the query model in Information Retrieval is ranking query results. However, the Boolean query model used in database systems does not support the ranking of query results. For example, a selection query on an SQL database returns all tuples that satisfy the conditions in the query. Furthermore, there are two scenarios that are not handled gracefully by an SQL system. These scenarios are, (a) the Empty-Answers Problem: when the query is too selective, the answer may be empty; and (b) the Many-Answers Problem: when the query is not selective enough, too many tuples may be in the answer. In both cases, it is desirable to rank the database tuples by their degree of “relevance” to the query (though the user may not have explicitly specified how) and return only the top-K matches. The difference in the two scenarios is that in the empty-answers case, the returned tuples only approximately match the query conditions, whereas in the many-answers case they are a subset of the tuples that match the query conditions.
Automated ranking of database query results has beneficial applications, such as for customers browsing product catalogs. Consider, for example, a potential home buyer searching for homes in a realtor, home-search database. The Empty-Answers Problem is illustrated by a very selective query such as “City=Seattle and Price=cheap and Pool=yes and Location=waterfront”, which may yield very few or no results. In this case, ranking the query results may not be particularly important. On the other hand, the Many-Answers Problem is illustrated by a query such as “City=Seattle and Location=waterfront”. This query is not very selective and may yield too many tuples in the results. Accordingly, a query model that could rank the database tuples by their degree of “relevance” to the query and return only the top-K matches would provide significant benefit. Currently, however, there are no query models available for structured databases that adequately address the Many-Answers Problem.
Ranking functions have been investigated in areas outside database research, such as in Information Retrieval (IR). The vector space model and probabilistic information retrieval (PIR) models are very successful in practice. However, such models are adapted for retrieving information in text data environments and do not necessarily benefit a structured data environment such as a database. The structured data environment of a database includes, for example, columns that signify groupings of attribute values, something which is not available in text data. Additionally, most ranking functions in Information Retrieval assume some form of independence between data values, because deriving associations/dependencies between data values is notoriously hard due to the huge size of the term space in text data. Ranking is also an important component in collaborative filtering research.
Previous database research includes some work on the automatic extraction of similarity/ranking functions from a database. Early work considered vague/imprecise similarity-based querying of databases. There have also been various methods proposed for integrating databases and information retrieval systems. Some prior methods employ relevance-feedback techniques for learning similarity in multimedia and relational databases. Other methods use keyword-based retrieval system over databases. However, previous methods have various disadvantages including, for example, the use of training data using queries requiring user attention, employing ad-hoc techniques loosely based on the vector-space model, and a failure to account for associations and dependencies between data values that exist in structured data environments.
Accordingly, a need exists for an improved way to rank database query results that takes advantage of probabilistic information retrieval (PIR) and accounts for associations and dependencies between data values that exist in structured data environments.