Database systems are storing increasing amounts of valuable data. A database system can collect and store millions and billions of new pieces of information every day. For example, a social networking website that is used by hundreds of millions of users on a daily basis may collect information regarding the time of each sign-in, the time of each sign-out, each web page visited, data entered on each webpage, and so on. As another example, a provider of smart phone applications that are used by millions of users may collect the input (e.g., keystrokes) of each user who interacts with the applications and other application-specific data such as location of use, advertisements displayed, advertisements clicked on, and so on. As yet another example, a conglomerate may comprise many corporations that each maintain diverse databases to store information of the corporation such as sales databases, employee databases, customer databases, product databases, and so on.
These various database systems, or more generally data sources, may store data on diverse computer systems that are distributed throughout the world and use diverse query engines. For example, the provider of applications (e.g., for mobile devices) may store the data for each application on a different computing system at a different location. The data sources may store data in various forms such as tables of a relational database, files with comma separated values (“CSVs”), spreadsheet files, fact tables of triples (i.e., subject, predicate, object), eXtensible Markup Language (“XML”) files, and so on. These data sources also provide different query engines that may be most appropriate for accessing their own data. For example, the query engines may employ a Structured Query Language (“SQL”), a Simple Protocol and RDF Query Language (“SPARQL”), XML Query (“XQUERY”) Language, application-specific application programming interfaces (“APIs”), and so on.
Data scientists are often tasked with extracting knowledge or insights from these data sources. For example, a provider of applications may want to maximize its advertising revenue resulting from advertisements displayed by the applications. A data scientist can help the provider by determining which type of advertisements are most effective for which type of users. Many tools are available to help a data scientist extract knowledge. These tools include machine learning tools, pattern recognition tools, statistical modeling tools, and so on. To use these tools, a data scientist needs to extract the data of interest from the various data sources. It would be very time-consuming and expensive for a data scientist to develop queries to extract data from each of these data sources that may use very different query engines and may be at geographically separated locations.
Federated database systems (also referred to as “federation engines”) have been developed to assist a data scientist in such extracting and combining. A federated database system provides a common query engine that employs a common query language for extracting data from data sources. For example, the common query language may be standard SQL. To use the federated database system, a data scientist inputs target queries in the common query language. A target query specifies the data sources of the data and various criteria of the data to be extracted. To process a target query, a federated database system generates a query for each data source that is in the query language of the data source, sends the queries to the data sources, receives the query results, and combines the query results to generate the query results for the target query.
A federated database system could simply extract all the data from each data source, store the data locally, and execute the target query against the locally stored data. Such an approach, however, has several problems. One problem is that it can be very time-consuming and expensive to extract all the data, transmit the extracted data for local storage, and store the data locally. Another problem is that data that is stored locally may become quickly out-of-date unless a complicated and expensive update process is employed.
To help reduce the amount of data that needs to be extracted and transmitted from each resource, a federated database system may push some of the query processing to the various data sources. For example, if a target query includes an expression specifying that only data after a certain date is needed, the federation database system may generate a query for each data source that specifies to extract only data after the certain date. This, of course, may reduce the amount of data that needs to be transmitted from the data sources to the federated database system. Many times the query processing cannot be pushed down to the data sources because of incompatibilities between the common query language and the query languages of the data sources. As a result, vast amounts of data may still need to be transmitted from the data sources to the federated database system.