Extract Transform Load (ETL) refers to a process in database usage, more specifically in data warehousing, performed by an ETL tool. The process includes extracting data from an outside source, transforming the data to fit operational needs, and loading the transformed data into an end target (e.g., database or database warehouse). Typically, ETL tools read data from source systems, such as a database, transform the data, and store frequently used data in what is called a dataset. An ETL process typically consists of numerous ETL jobs which the ETL tool sequences together. A dataset is typically created by one of the ETL jobs and is used by the rest of the jobs in the sequence. In certain instances, where a particular dataset was already transformed, additional ETL jobs may still request the particular dataset. It is not uncommon for the dataset to be large, usually consisting of gigabytes (GB) of data. In order for the ETL tool to obtain a particular data element in a dataset, the ETL tool would typically have to sort through an extensive quantity of data to locate the particular data in the dataset. Sorting through the dataset to locate the data can be time consuming and delay any additionally received requests for datasets.
Reading data from a large dataset can be similarly inefficient if the particular dataset is needed for multiple ETL jobs. For example, there may be instances where the ETL tool requests a particular dataset multiple times and the ETL tool has to scan through the data in the dataset multiple times. Furthermore, datasets are typically partitioned when they are stored, where a single dataset can be stored in multiple file locations. The ETL tool has to access all the multiple file locations to obtain the partitioned dataset and the ETL tool performs this every time the partitioned dataset is requested for an ETL process.