Data-intensive applications such as decision support and e-commerce rely on being able to integrate data from various sources. To accomplish this task, a data transformation query is created between a data source and target. A variety of issues have to be addressed in identifying such a query. The data is often inconsistent owing to factors such as data entry errors and missing information. These inconsistencies must be removed before the data can be loaded and used for analysis. This is further compounded by the presence of mismatches between the source and the target schemas that need to be reconciled. As a result, the problem of data integration is widely recognized to be a significant challenge.
The space of reasonable transformation queries between data source and target can be enormous, and it is very difficult for users to consider and even conceive of all possible options. This is aggravated by the fact that a user may not understand the source data fully. As a result, users need to try different queries iteratively until a satisfactory result is obtained. Previously published work has thus identified the need for interactive tools that help users understand the impact of a transformation query.
In such an interactive environment, it is natural to reason about the difference between queries. Even a small change to a transformation query, such as changing a join column, relaxing an equi-join to a join that exploits string similarities (also known as similarity joins), changing the thresholds for similarity comparison or adding an extra join can have substantial impact on the results of the query. It is therefore very natural to ask whether such a change produces tuples that are expected at the result but were previously absent or suppresses tuples that were erroneously generated.
Of course, the difference between queries can be computed in SQL (Structure Language Query), using the EXCEPT, EXCEPT ALL or MINUS clause. However, the performance of this approach is highly inadequate, especially when the two queries are closely related to each other. In particular, assume there are two queries Q1 and Q2. Conventionally, the difference is computed by executing both Q1 and Q2 and then determining the difference.