Extract, transform, and load (ETL) is a process in data warehousing that involves extracting data from outside sources, transforming the data in accordance with particular business needs, and loading the data into a data warehouse. An ETL process typically begins with a user defining a data flow that defines data transformation activities that extract data from, e.g., flat files or relational tables, transform the data, and load the data into a data warehouse, data mart, or staging table. A common operation defined in a data flow is a splitter operation. A splitter operation produces multiple output data sets from a single input data set, according to specified (Boolean) conditions. Each output data set can then be further transformed prior to being loaded into a data target table.
The implementation of a conventional splitter operation by ETL vendors can be generalized into two categories. The first category of splitter operations includes those associated with ETL vendors that implement a proprietary ETL engine (e.g., Information or IBM DataStage Server), in which splitter operations are handled by an ETL engine. The second category of splitter operations includes those associated with ETL vendors that use a database server for ETL processing (e.g., Oracle Warehouse Builder (OWB) or Microsoft SQL server), in which splitter operations are either handled by a database server using row based structured query language procedural language (SQL/PL), or a combination of SQL and PL. A splitter operation can be implemented with proprietary SQL statements (Oracle uses, for example, multiple table insert statements to insert multiple outputs from a splitter operation into multiple target tables). An advantage of using procedural language to handle a splitter operation is that procedural code can handle complex row based data transformations—e.g., input data can be examined row by row and complex conditions can be applied (as well as column level transformations) to each row prior to routing the data to a target table.
There are, however, a few drawbacks associated with using procedural language and row based processing to implement a splitter operation. For example, the SQL/PL code generated by OWB generally cannot run on other types of database servers without among different vendors. Also, although row based processing provides for the application of complex conditions and column level transformations to each input row, row based processing is not always efficient, especially when there are a large number of rows to process for a given ETL process.