When performing Extraction-Load-Transform (ETL) from a large dataset (Source) into a warehouse or data-mart application (Destination), it may take from days to months to complete the operation, depending on the size and complexity of the data and the transformation being performed. Additionally, once the data is loaded, it needs to be synchronized (Refreshed) with the Source data when it changes.
Traditionally, however, a Destination would be required to wait for the initial ETL to finish before the Refresh occurs, causing obvious efficiency problems. The initial ETL would start with the oldest data in the system and process until it has reached the most current data in the Source system. The reason to do this was technology-bound: When processing a dataset, previous data was required before updates to that data could be processed. A Refresh could not be started until all previous data was processed because any updates would be lost and technology would not know what to do with the partial information.
Consequently, this limitation would result in significant loss of time for a data-warehouse application waiting for the historical data to be loaded. In fact, systems could not be declared “operational” until data was loaded because the most important and useful information, namely the most current data, was processed last. Customers would not like this, but they know that there was no choice, since that is how ETL traditionally was done.
ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may be outside the expectations of designers at the time validation and transformation rules are specified. Data-profiling of a Source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process.
Data warehouses are typically fed asynchronously by a variety of Sources which all serve a different purpose, resulting in, for example, different reference data. ETL is a key process to bring heterogeneous and asynchronous Source extracts to a homogeneous environment.
The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes understanding the volumes of data that will have to be processed within service level agreements (SLAs). The time available to extract from Source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues or real-time change data capture (CDC) for continuous transformation and update.
To meet these challenges, a recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data. There are 3 main types of parallelisms as implemented in ETL applications:                Data—By splitting a single sequential file into smaller data files to provide parallel access.        Pipeline—Allowing the simultaneous running of several components on the same data stream. An example would be looking up a value on record 1 at the same time as adding together two fields on record 2.        Component—The simultaneous running of multiple processes on different data streams in the same job. Sorting one input file while performing a de-duplication on another file would be an example of component parallelism.        
All three types of parallelism are usually combined in a single job. However, an additional difficulty is making sure that the data being uploaded is relatively consistent. Since multiple-Source databases all have different update cycles (for example, some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all Sources are synchronized. Likewise, where a data warehouse may have to be reconciled to the contents in a Source system or with the general ledger, establishing synchronization and reconciliation points is necessary.
It should be noted that nowhere in the above-discussed parallelism solutions is a mention of being able to process starting in the middle of a data feed, processing forward and backward, or of optimizing how data is queried from the Source systems, which are concepts that would ETL significantly better.