The World Wide Web (“the Web”) contains a significant volume of structured data in various domains such as finance, technology, entertainment, and travel. Typically, this data exists in Web databases, hypertext markup language (“HTML”) tables, HTML lists, and the like. Advances in data integration technologies have made it possible to query such data. For example, a vertical search engine accepts queries on the schema it provides, retrieves answers from various sources, and returns the union of the answers.
Different Web sources often provide information for the same data item. However, since dirty and erroneous information exists on the Web, data retrieved from different sources is often in conflict. For example, in data retrieved from different websites there may be different addresses for the same restaurant, different business hours for the same supermarket at the same location, different closing quotes for the same stock on the same day, and so on. In addition, the Web has made it convenient to copy data between sources, so inaccurate data can quickly propagate to other sources. Integration systems that merely take the union of the answers from various sources can thus return conflicting answers, leaving the difficult decision of which answers are correct to end users.
Recently, a variety of data fusion techniques have been proposed to resolve conflicts from different sources and create a consistent and clean set of data. Data fusion techniques aim to discover the true values that reflect the real world. To achieve this goal, these techniques not only consider the number of providers for each value, but also reward values from trustworthy sources and discount votes from copiers. Such techniques are designed for offline data aggregation. However, aggregating all information on the Web and applying fusion offline is infeasible because of the sheer volume of Web data and the frequent update of Web data. On the other hand, the whole process can be quite time-consuming and inappropriate for query answering at runtime.