Extract, transform, and load (ETL) is a process in data warehousing that involves extracting data from outside sources, transforming the data in accordance with particular business needs, and loading the data into a data warehouse. An ETL process typically begins with a user defining a data flow that defines data transformation activities that extract data from, e.g., flat files or relational tables, transform the data, and load the data into a data warehouse, data mart, or staging table. A data flow, therefore, typically includes a sequence of operations modeled as data flowing from various types of sources, through various transformations, and finally ending in one or more targets, as described in U.S. patent application entitled “Classification and Sequencing of Mixed Data Flows” incorporated by reference above. In the course of execution of a data flow, data sometimes needs to be exchanged or staged at intermediate points within the data flow. The staging of data typically includes saving the data temporarily either in a structured physical storage medium (such as in a simple file) or in database temporary tables or persistent tables. In some cases, it may be optimal to save rows of data in the processing program's memory itself, especially when large and fast caches are present in the system (such “staging” is often referred to as “caching”).
ETL vendors conventionally support data exchange and staging internally inside of an ETL engine in a proprietary fashion, especially if the ETL engine is running outside of a relational database. For example, the DataStage ETL engine permits users to build “stages” of operations—i.e., discrete steps in the transformation sequence—and physically move rows between different stage components in memory. (Note: The term “stage” as used in the context of the DataStage engine—does not refer to the concept of saving rows to a physical media, but rather to unique operational steps). This method, typically allows for some types of performance optimizations; however, the rows of data being moved between the different stages are usually in an internal format (stored in internal memory formats in buffer pools) and the only way a user can view the rows of data is to explicitly define a File Target (or a Table Target) in the data flow and force the rows of data to be saved into a file (or a table)—i.e., only the target of such a data flow can physically export the rows into a user recognizable format.
Accordingly, a common problem of conventional data exchange and staging techniques is that users are not able to specify staging points explicitly and directly in the middle of a data flow, but only as the end of a transformation sequence using target operators. Target operators typically do not serve as an exchange operator—since target operators are destinations. For example, if a user needs to extract rows from a SQL (structured query language) table and pass the rows as input to another type of system which requires a file as input, then the user would have to represent such a process with a first job—as a Table Source operation followed by a File Target or Export operation having a specific file name. The user would then have to schedule a second (separate) job to invoke an operation that uses the file as input.