Data integration is a key challenge faced by many data scientists, spreadsheet end-users, and others. Despite several recent advances in techniques for making it easier for users to perform data analysis, the inferences obtained from the analyses is only as good as the information diversity of the data. Nowadays, a lot of rich and useful data is available on the web, but typically in some semi-structured format. This gives rise to many interesting integration tasks that require combining spreadsheets with the rich data available on the web. However, currently, users have to either manually carry out these tasks or write complex scripts to find the desired web sources and scrape the required data from those sources. Unfortunately, many users (e.g., data scientists, spreadsheet end-users, etc.) come from diverse backgrounds and lack programming knowledge to write such complicated scripts.
There are three main challenges in programmatically joining web data with relational data in a table. First, most websites do not provide a direct interface to obtain the complete tabular data, so a user needs to formulate a logic to get to the webpages for each table row. Second, after reaching the desired webpage, the user needs to write an extraction logic that depends on the underlying DOM structure of the webpage to retrieve the relevant data. Third, the common data key to join the two sources (spreadsheet and webpage) might be in different formats, which requires writing additional logic to transform the data before performing the join.
Existing approaches have explored two different strategies for dealing with the first challenge of finding the webpages that contain the required data. A first existing approach takes a few input-output examples and uses a fully automatic approach by searching through a huge database of web forms that can transform the given examples. However, many websites do not expose web forms and do not have the data in a single table. Other existing approaches are programming by demonstration (PBD) systems that rely on users to demonstrate how to navigate to the desired web pages for a few examples. Although the PBD systems can handle a broader range of web pages compared to the first approach above, the PBD systems tend to put additional burden on users to perform exact demonstrations to get to the webpage, which has been shown to be problematic for users.
The existing approaches and systems also assume that the desired data in each webpage can be retrieved using a simple extraction logic. For instance, the first existing approach mentioned above uses an absolute XPath program to extract the data. However, this is a very strong assumption on the similarity of the different webpages. On the other hand, there are efficient systems for wrapper induction, which is a technique to learn robust and generalizable extraction programs from a few labeled examples. However, most of these techniques have only been applied in the context of extracting huge amounts of data from the web. Applying wrapper induction to data integration tasks poses new challenges as the data that needs to be extracted can be conditioned on the input data in the table.