Data integration applications are commonly used to integrate data, e.g., due to mergers and acquisitions among businesses. Data integration can occur in other circumstances, such as when merging data from two different departments (or divisions) in a company, or integrating new data application software with a legacy application. Typical data integration applications are those types of applications in which data from multiple sources on varied data systems and repositories needs to be processed, combined, or otherwise transformed into data that is then loaded into multiple targets, again residing on different data systems and repositories. For example, in some cases, legacy data in text files or spreadsheets is best processed by a specialized extract, transform, and load (ETL) engine. Legacy data is any file, database, or software asset (such as a web service or business application) that has been previously deployed. Two kinds of ETL engines include a relational database engine (e.g., an Oracle or Microsoft Server engine) that can execute structured query language (SQL) statements, and proprietary engine (e.g., from Ascential) that can execute propriety language statements.
In an ETL environment in which a database server acts as an ETL engine, data integration typically begins with a user describing a data flow of an ETL process using a UI (user interface) tool. A data flow represents a logical transformation and flow of data. A code generation system generates code from the data flow, which generated code is then sent to the database server for execution. There are generally two basic requirements involved in code translation and generation—1) generation of correct code that reflects the semantics of an operator graph corresponding to the data flow, and 2) generation of optimized code that has the best performance (e.g., in terms of time and space efficiency) for the underlying ETL engine. Conventional SQL code generation systems, however, typically rely on the underlying ETL engine (or database server) to perform optimization and, therefore, such conventional code generation systems typically only directly translate an operator graph into SQL code. The quality of SQL code generated through such a direct translation approach can vary greatly and, therefore, users are required to be experts in order to verify and take advantage of the generated code. Further, while conventional ETL engines can be optimized for ETL processing, conventional ETL engines can typically only handle single-statement code, and fall short on generating an optimized plan for a much larger amount of multiple-statement code that is typically generated by a code generation system.