ETL jobs are designed to extract data from one or more database or non-database systems (called source systems), transform the data that is extracted based on business logic, and load the transformed data to one or more database or non-database systems (called target systems). The term “database tables” is used herein to represent database tables and non-database data that may be “mapped” to “sets of rows”. For example, a Comma Separated Values (CSV) file may be mapped to a table, where each comma separated field in a line in the CSV file is mapped to a table column value, and the whole line is mapped to a row in the mapped table.
A traditional ETL tool executes the ETL jobs on a local machine (where a machine is a computing system) that may access one more local or remote source systems. The source and target databases may not reside on the same physical machines as the ETL tool (i.e., on the local machine). While running the ETL job, the ETL tool connects to the database that may be hosted on a remote machine over a network and extracts the data to the local machine, transforms the data locally, and loads the transformed data to the target database that is located on another remote machine in the network.
The data that the ETL job processes has been growing rapidly. In a large enterprise, there may be hundreds of such ETL jobs running in parallel. When the source or target data is present on some other location (e.g., a cloud system), it may take a large amount of time to fetch the data to the local machine and then load the transformed data back to the cloud. This increases the completion time of the ETL job and also increases the load on the local machine executing the ETL job. This may result in, for example, network timeout, ETL jobs crashing because of lack of memory, an ETL job hanging or running forever, network congestion, etc.