Data integration refers to the combination of data from one or more sources into a homogenous environment at a target destination. For example, a financial institution may combine data about financial transactions from multiple sources into a data warehouse. Extract, transform, and load (ETL) refers to a process that extracts data from one or more sources, transforms it to fit operational needs of an organization, and loads it into an end target, such as a database or data warehouse. Data integration systems, such as ETL systems, may process multiple terabytes of data using multiple servers with multiple processing units. To efficiently handle large amounts of data, data integration systems may implement parallel processing techniques, such as dividing a large dataset into smaller datasets and processing each of the smaller datasets in parallel.
A particular parallel processing technique, referred to as grid computing, takes advantage of a distributed network of general purpose computing nodes acting together to perform large tasks. A head node, also referred to as a conductor node, may control the scheduling and partitioning of jobs among the compute grid nodes. Compute grid nodes may process one or more smaller datasets in parallel with other compute grid nodes. In such a system, a complete dataset may be represented as a collection of smaller datasets stored at locations particular to the compute grid nodes processing each of the smaller datasets. The complete dataset may contain metadata indicating a location of the smaller datasets comprising the complete dataset.
A large ETL process may comprise several stages, with intermediate datasets created at each stage. Because a dataset in a traditional data integration system may contain metadata specific to resources within a particular data integration system, datasets created in one data integration system may not be readable by a second data integration system.