The present invention relates to the field of controllers and methods for controlling Extract-Transform-Load (ETL) processors.
In the field of very large data storage repositories, such as data warehouses, there is frequently a need to take data from a plurality of sources, often under the control of heterogeneous data storage systems, and to aggregate the data in such a way as to make it capable of coherent processing. The need for aggregation of data from such a plurality of data sources has given rise to a number of systems designed to perform the tasks of extracting, transforming and loading the data.
Before a repository of data can be effectively used as a source of truly usable information, it is usually created or updated using many sources. Most often, the data that is accumulated (and later used for update of the repository) is of a different format residing on an external system than what is ultimately needed in the repository. The process of acquiring this data and converting it into useful, compatible and accurate data is often labelled ETL (for Extraction, Transformation, and Load).
Extraction is the task of acquiring the data (in whatever format might be possible) from the source systems. This can be as simple as dumping a flat file from a database or spreadsheet, or as sophisticated as setting up relationships with external systems that then supervise the transportation of data to the target system.
Transformation is often more than just converting data formats (although this is a critical step in getting the data to the target system). Data from external systems may contain incompatible or incorrect information, depending on the checks and balances that were in effect on the external system. Part of the transformation step is to “cleanse” or “reject” the data that does not conform. Common techniques used as part of this step include character examination (for example, reject numeric value fields that contain characters) and range checking (reject values outside of an acceptable range). Rejected records are usually deposited in a separate file and are then processed by a more sophisticated tool or manually to correct the problems. The values are then rolled into the transformed set.
Load is the stage in which the captured and transformed data is deposited into the new data store (warehouse, mart, etc.). For SQL-compatible database systems, this process can be accomplished with SQL commands (IMPORT), utilities (LOAD), or integrated tools. Additionally, the total ETL process can be accomplished via third party applications, often decreasing or eliminating the need for custom programming. The ETL process can be as simple as transferring some data from one table to another on the same system. It can also be as complex as taking data from an entirely different system that is thousands of miles away and rearranging and reformatting it to fit a very different system.
At its very simplest level an ETL (Extract Transform Load) job is a process that reads data from one source (such as a database), transforms it (for example, remove trailing spaces), and finally writes it to a target (such as a file). An ETL job design consists of one or more stages, each stage performing a discrete function such as read from database, sort data, merge data etc. The data read from, or written to, stages may be represented as links that join the stages together. ETL job designs can vary from the simple, consisting of only a handful of stages, to the complex where the number of stages can exceed one hundred.
An ETL job design is typically constructed by the user (an ‘ETL developer’) dragging and dropping stages onto a graphical canvas and then linking their input and outputs together. The stages chosen, the way they are joined together, and the values of properties set will together satisfy the high level requirements for that job. Currently ETL developers need to be extremely knowledgeable about the ETL application and know exactly what stages they should use to achieve this requirement. This becomes a barrier for customers who want to get their developers up and running quickly with their ETL applications. Even for developers who are proficient with the application, it can be hard to remember exactly what stage can be linked to other stages and in what circumstances such links are recommended or not. Such barriers to learning add cost to the process and introduce potentially significant opportunities for human error.