ETL is a general tem used to describe a process in data warehousing (or populating databases) that involves extracting data from sources, transforming it to fit the business needs, and ultimately loading it into the database.
The first part of an ETL process is to extract the data from the source systems. Many database populating projects consolidate data from many different sources. Each separate source may also use a different data organization structure and/or format. Such data sources may include relational databases and flat files, but may also include non-relational database structures. The extraction step extracts and parses the data into a format for the transform process.
The transform process typically applies rules and/or functions to the extracted data, to prepare the data to be loaded. The requirements for the transformation process are a function of the extracted data and the ultimate form of the data for the database. Such transformation rules and functions include selecting certain fields, translating coded symbols, encoding new symbols, calculating values, merging sources, etc.
The load process loads the transformed data into the database. Depending on the requirements of the organization, this process ranges widely. Some databases merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the database.
An ETL process can be very complex, with significant operational problems. One such problem is scalability. By its nature, the ETL process may be performed periodically, with increasing amounts of data and data sources to process. Further, the transformation rules and functions may be continuously modified to handle data differently based on the changing needs of the business.
Another problem with running an ETL process is that it can require a significant amount of processing time, and can easily involve errors and other unforeseen conditions that can halt the process. Typically, a process stage can not be restarted, but must be run again from the beginning. This can significantly lengthen the time to perform the complete ETL process.
Still another problem with the ETL process is that it requires a large amount of effort to design and build the ETL system for a particular domain. A mid-range data warehouse project can have 200-300 separate jobs. There are few automation tools or techniques to assist the developers, who must design, build, run, and maintain such a complex system.