Databases and Information Retrieval have taken two philosophically different approaches to queries. In databases, SQL queries have a rich structure and precise semantics, which makes it possible for users to formulate complex queries and for systems to apply complex optimizations. Yet, users need to have a relatively detailed knowledge of the database in order to formulate queries. For example, a single misspelling of a constant in the WHERE clause of a query results in an empty set of answers, frustrating casual users. By contrast, a query in Information Retrieval (IR) is just a set of keywords and is easy for casual users to formulate. IR queries offer two important features that are missing in databases: the results are ranked, and the matches may be uncertain, i.e., the answer may include documents that do not match all the keywords in the query. While several proposals exist for extending SQL with uncertain matches and ranked results, they are either restricted to a single table, or, when they handle join queries, adopt ad-hoc semantics.
To illustrate the point, consider the following structurally rich query, asking for an actor whose name is like “Kevin” and whose first “successful” movie appeared in 1995:
SELECTA.nameFROMActor A, Film F, Casts CWHEREC.filmid = F.filmidandC.actorid = A.actoridandA.name ≈ “Kevin”andF.year ≈ 1995andF.rating ≈ “high”SELECTMIN(F.year)FROMFilm F, Casts CWHEREC.filmid = F.filmidand C.actorid = A.actoridand F.rating ≈ “high”
The three ≈ operators indicate the predicates are intended as uncertain matches. Techniques like edit distances, ontology-based distances, IDF-similarity, and QF-similarity can be applied to a single table, to rank all Actor tuples (according to how well they match the first uncertain predicate), and to rank all Film tuples. But, it is unclear how to rank the entire query, which is considered complex because it includes a nested query (i.e., the second section wherein a result must be selected in regard to the film year. To date, no system combines structurally rich SQL queries with uncertain predicates and ranked results. No conventional approach is able to effectively determine accurate probability results for queries that include joins, nested sub queries, aggregates, group-by, and existential/universal quantifiers.
This problem has been addressed in the past by employing a database in which each tuple has an associated probability, which represents the probability that the tuple actually belongs to the database. Examples of probabilistic relational databases are shown below. However, the results using such databases with the conventional approach are often incorrect, as demonstrated below. When queries are evaluated over a probabilistic database, the system should preferably compute a traditional query answer, as well as a probability for each tuple in the answer. The answer tuples might then be sorted according to this latter probability, and presented to the user. Users would then be able to inspect the top results returned, e.g., up to 20-40 answers, which should represent the most relevant answers to the query.
Adding probabilities to relational databases is known in the prior art. However, the prior art does not explain how probabilities added to a database can be made applicable to a wide range of applications, such as queries with uncertain predicates, queries over two databases for which there are fuzzy object matches, and queries over integrated data that violate some global constraints and do not provide an efficient approach to computing probabilistic answers to queries.