Many data sources contain data entities that may be ordered according to a variety of attributes associated with the entities. Such orderings result effectively in a ranking of the entities according to the values in an attribute domain. Such values may reflect various quantities of interest for the entities, such as physical characteristics, quality, reliability or credibility to name a few. Such attributes are referred to as rank attributes. The domain of rank attributes depends on their semantics. For example, the domain could either consist of categorical values (e.g., service can be excellent, fair or poor) or numerical values (e.g., an interval of continuous values). The existence of rank attributes along with data entities leads to enhanced functionality and query processing capabilities.
Typically, users specify their preferences toward specific attributes. Preferences are expressed in the form of numerical weights, assigned to rank attributes. Query processors incorporate functions that weight attribute values by user preference, deriving scores for individual entities. Several techniques have been developed to perform query processing with the goal of identifying results that optimize such functions. A typical example is a query that seeks to quickly identify k data entities that yield best scores among all entities in the database. At an abstract level, such queries can be considered as generalized forms of selection queries.
Several prior art techniques disclose a framework for preference based query processing. Such works consider realizations of a specific instance of this framework, namely top-k selection queries, that is, quickly identifying k tuples that optimize scores assigned by monotone linear scoring functions on a variety of ranked attributes and user specified preferences. Most of these techniques for answering top-k selection queries, however, are not based on indexing. Instead, they are directed towards optimizing the number of tuples examined in order to identify the answer under various cost models of interest. Such optimizations include minimization of tuples read sequentially from the input or minimization of random disk access.
However, the few available techniques that do disclose indexing for answering top-k selection queries do not provide guarantees for performance and in the worst case, an entire data set has to be examined in order to identify the correct answer to a top-k selection query.