1. Field of Invention
This invention relates to relational databases. More specifically, this invention is directed to a system and method for obtaining complete and correct answers from databases that are incomplete and/or partially incorrect.
2. Description of Related Art
A database is usually assumed to be complete and correct. For example, in a relational database it is assumed that the extension of every relation contains all the tuples that are required to be in the relation. However, there are situations in which database query interfaces attempt to access databases that are only partially complete (i.e. some tuples are missing) and/or contain partially incorrect information. For example, database query interfaces that provide access to multiple heterogeneous information databases often encounter incomplete databases. For instance, a database query interface may have access to a university repository that has a database of publications authored by some, but not all, of the faculty and students of that university. However, the same interface may have access to the database of the Library of Congress, which has all of the books published in the United States in the past few decades.
If a database is only partially complete, the database query interface should reconsider the meaning of an answer provided by the database to a given query. For queries that do not contain negation, the answers obtained from the database are guaranteed to be a subset of the answers that would have been obtained if the database were complete. However, the database query interface must still determine whether the answer is complete. In addition, when the database query interface issues a query that contains negation to the incomplete database, the resulting answer may not be correct.
As an illustrative example, consider a database query interface that has access to several online databases with information about movies. The relation schema contains the following relations:
Movie (TITLE, DIRECTOR, YEAR) PA1 Show (TITLE, THEATER, HOUR) PA1 Oscar (TITLE, YEAR) PA1 (Q.sub.1): SELECT m.TITLE, m.DIRECTOR PA1 (Q.sub.2) SELECT m. DIRECTOR PA1 (Q.sub.3): SELECT m.DIRECTOR
The relation "Movie" contains tuples describing the title, director and year of production of movies. The relation "Show" describes the movies playing in New York City. Specifically, the relation Show contains tuples that describe the title of the movie, the theater in which the movie is playing and at what hour the movie plays. The relation "Oscar" contains a tuple for each movie that won an Oscar.TM. award, and a tuple for the year in which it won the award.
Assume that the relations Show and Oscar are known to be complete, and that the relation Movie is complete only from the year 1960 (i.e., it may be missing movies from earlier years). Further assume that the interface has issued the following query Q.sub.1 that asks for the pair (movie, director) for movies currently playing in New York City:
FROM Movie m, Show s PA2 WHERE m.TITLE=s.TITLE. PA2 FROM Movie m, Oscar o PA2 WHERE m.TITLE=o.TITLE AND PA2 FROM Movie m, Oscar o PA2 WHERE m.TITLE=o.TITLE AND
The answer to this query may be incomplete. Intuitively, the answer is incomplete because if some of the missing tuples are inserted into the relation Movie, the answer to the query may change. However, assume that the system issues the following query Q.sub.2 that asks for directors whose movies have won Oscars.TM. since 1965:
m.YEAR=o.YEAR AND PA3 o.YEAR.gtoreq.1965 PA3 NOT EXISTS (SELECT * FROM Movie m1, Oscar o1 PA4 WHERE m.DIRECTOR=m1.DIRECTOR AND PA4 m1.TITLE=o1.TITLE AND PA4 o1.YEAR.gtoreq.1965) The query Q.sub.3 asks for directors who have won Oscars.TM. but who have not won any additional Oscars.TM. since 1965.
The answer to this query is complete, even though the relation Movie is not complete. This is because only tuples of a movie whose third argument is 1965 or greater can be joined with tuples from the relation Oscar to yield an answer to the query. Therefore, if Movie is complete for that part of the relation, the answer to the query will be complete. Furthermore, because the answer to Q.sub.2 is guaranteed to be complete, the answer to the following query is also guaranteed to be complete:
An approach for determining answer-correctness and answer-completeness in the presence of incorrect or incomplete databases is described in Motro et al., "Integrity=Validity+Completeness," ACM Transactions on Database Systems, Vol. 14, No. 4, December 1989, pages 480-502. This approach is based on describing the complete or valid parts of the database as "views." Given a query Q, if the query can be rewritten using the complete or valid views, then the answer is complete. Mctro describes a method that finds "rewritings" of the queries using the views. However, the method is not guaranteed to find a rewriting if one exists.
A complete method for finding rewritings of queries using views is described in Levy et al., "Answering Queries Using Views," Proceedings of the 14.sup.th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Jose, Calif., 1995 (Levy 1). However, finding a rewriting of the query using views has not been shown to be a necessary condition foc answer-completeness. In addition, the method described in Motro does not address the problem of determining answer-completeness with respect to a specific database instance (i.e., whether the answer to the query is complete given the current state of the database).
Another method for determining answer-completeness is described in Etzioni et al., "Tractable Closed World Reasoning With Updates," Proceedings of KR-94, 1994. Etzioni determined that answer-completeness is closed under conjunction and partial instantiation of queries, and used these determinations as the basis for the method. However, the method disclosed in Etzioni is not guaranteed to always detect answer-completeness. In addition, the method only allows a very restricted class of local completeness statements. In particular, the method disclosed in Etzioni does not allow local completeness statements that contain existential variables, or that contain interpreted predicates.
Yet another method for determining answer-completeness is disclosed in Elkan, "Independence of Logic Database Queries and Updates," Proceedings of the 9.sup.th ACM Symposium on Principles of Database Systems, 1990, pages 154-160. However, this method applies only to queries with no self-joins (i.e., at most one occurrence of every relation in the query).