Enterprises and organizations may have information located in many different and diversely located databases. These databases may be memory areas within the same data store, or may be resident on various data store units. For example, a manufacturing enterprise may have customer contact information in a sales department database, accounting database information (e.g., invoicing, accounts receivable, payment, credit, etc.) may be in another database, manufacturing department information (e.g., bill of parts, vendor, assembly instructions, etc.) may be in yet another database. Or, several departments may have customer information in each database, but the information may be listed differently for each database (by name, by account number, by phone number, or first name last, last name first, etc.). An information integration system may process the information in the various databases with a flow to gather information in these databases and relocate the information to a common repository referred to as a data warehouse.
An information integration flow may be a series of instructions that may be responsible for extracting data from data sources, transforming the data, and finally, populating the data to a central data warehouse. Conventional ETL (extract, transform, load) tools may implement their own design and representation methods for capturing the semantics (e.g., relationships between the various data elements) and the functionality of information integration flows.
These ETL tools may use a graphical user interface to represent these flows. For example, information integration flows may be represented by a directed acyclic graph (DAG). A DAG is a graphic-based representation that may model several different kinds of structure in, for instance, computer science. The DAG may represent a set of sequence of data flow and control as edges (lines). Data may enter a node, which may represent a processing element or step (e.g., transformations, operations, steps, tasks, activities), through its incoming edges and leaves the vertex through its outgoing edges. Design constructs may be used to automatically create executable scripts for running the ETL process from the DAG.
As information integration flows become more complex, and business managers seek stricter time-related requirements (e.g., small execution windows, longer uptime, increasing freshness, more fault-tolerant flows, etc.), optimizing information integration flows may become more important. Currently information integration flow optimization is typically done manually. Some ETL tools provide some primitive optimization mechanisms, but these conventional tools cannot optimize a large-scale, real-world information integration flow as a whole.
Research results suggest optimization techniques at different design levels. Information integration flows may be optimized during two phases: during flow design and during flow execution. Example optimization techniques may include flow rewriting and restructuring, partitioning, use of recovery points, redundancy, scheduling, choice among alternative implementations for the same task, changes in resource allocation, etc.
Conventional information integration projects may be designed for correct functionality, and adequate performance, i.e., to complete within a specified time window. However, optimization of the information integration flow design is a task left to the experience and intuition of the natural person designers of the information integration flow.