Data integration technologies facilitate providing and managing meaningful information to obtain a competitive business advantage, for example by harnessing historical data to aid future decisions. At the core, integration technologies are systems and methods to extract, transform, and load (ETL) data. Data can be provided from myriad sources including enterprise resource planning (ERP) and customer relation management (CRM) applications as well as flat files, and spreadsheets, among others. Extraction mechanisms can retrieve data from several different sources. After data is extracted, it can be transformed into a consistent format associated with a target repository. Some data may only need to be reformatted during the transformation process. However, other data may need to be cleansed for instance of duplicates. Subsequently, data can be loaded into a data warehouse, data mart or the like where the data can be mined and otherwise analyzed to retrieve beneficial information.
More than half of an extract, transform and load process typically needs to be custom programmed for an organization. In one conventional implementation, packages are central to such a program and represent a unit of work that can be independently retrieved, executed and/or saved. Furthermore, the package serves as a container for all other elements broadly characterized as control flow or data flow.
Control flow elements dictate processing sequence in a package and can include one or more containers to define package structure, tasks that define package functionality or work and precedent constraints that link executables, containers and tasks and specify the order of execution of the linked objects. Control flow elements prepare or copy data, interact with other processes or implement repeating workflow.
Data flow elements including source adapters, transformations and destination adapters, as the name suggests, define the flow of data in a package that extracts, transforms and loads data. Source adapters make data available to a data flow. Transformations perform modifications to data such as aggregation (average, sum), merging (of multiple input data sets), distribution (to different outputs) and data type conversion. Destination adapters load output of the data flow into target repositories such as flat files, databases, or memory.
A data flow pipeline employs multiple elements or components tied together via collections of metadata. A data flow pipeline or diagram thereof can include components and paths that define how the data moves through or with respect to a task. For example, if a task corresponds to reading a text file that has rows and columns of information for an employee, there could be a file full of row information such as first name, last name, social security number, and the like. Here, each column has metadata associated with it such as name is a string and age is a number, for example. This metadata is important to the data flow because it tells an engine that is moving the data and components that are acting on the data what types of operations can be performed successfully on that data. As per the aforementioned example, it is appreciated that different operations can be executed on numbers and strings. If the metadata changes then actions down stream will break. For instance, assume one starts with a column age that is a number and down stream a component uses the age to compute an average age. If the column data is amended subsequently to be a string, the data flow will break, as the average operation cannot compute the average of a string. To remedy this situation a user will conventionally fix the components manually to account for the metadata change.