In many data processing and analysis applications, especially those involving large amounts of data, top-k ranking queries are often used to obtain only the k most relevant data tuples for inspection, with relevance represented as a score based on a scoring function. There are many existing techniques for answering such ranking queries in the context of deterministic relational databases in which each data tuple is an ordered sequence of deterministic attribute values. A typical deterministic relational database employs a deterministic relation to encode a set of tuples each having the same attributes to yield a single data set instantiation, with each tuple representing a particular deterministic occurrence of an ordered sequence of the attribute values. A top-k query of such a deterministic relational database returns the k tuples having the top scores in the single data set instantiation based on a specified scoring function that evaluates the ordered sequence of attribute values to determine a single score for each tuple.
A probabilistic database uses an uncertainty relation to encode the set of tuples into multiple possible non-deterministic data set instantiations due to the randomness associated with each tuple. Accordingly, each tuple may exhibit different scores having respective different likelihoods for some or all of the different possible non-deterministic data set instantiation realized by the uncertainty relation. Because each tuple can be associated with multiple different scores having respective different likelihoods, conventional top-k query techniques that rank tuples assuming a single score per tuple are generally not applicable in a probabilistic database setting.