Data lineage information describes origins and history of data. More specifically, the data lineage information describes data life cycle stages including creation, transformation, and processing of data. Data may be represented in multiple ways, ranging from files to analytic datasets, key performance indicators (KPIs), and dashboards. Data management tasks such as data modeling, data administration, data integration, etc. rely on the data lineage information. The data lineage information is also valuable for big data projects as organizations increasingly adopt big data infrastructures such as Amazon S3® or Apache Hadoop® to store various types of datasets (logs, receipts, feeds, etc.). The organizations also utilize Apache Hadoop® as a development infrastructure for building software information, where raw datasets are transformed and combined into aggregated data. The data provided through Amazon S3® or Apache Hadoop® data pipelines may be loaded into business intelligence (BI) infrastructures. However, it is becoming more difficult to understand, manage and govern large amounts of data created for the big data projects. For example, conforming to government regulations and data policies becomes increasingly important for various industries. Since lack of data control constitutes a foundation level of data infrastructures of many industries, auditing and conformance to data management regulations are further complicated.
Two main use cases of data lineage are impact and lineage analysis. For example, an impact analysis across connected systems is required when developers perform maintenance operations. Changing organization of a dataset to meet requirements of an application or changing definition of a computation specification that describes data transformations may require understanding of the impact such changes may have on associated computation specifications and datasets, possibly located at the connected systems. Conversely, when accessing the dataset, a user may request original datasets from which the dataset was produced and the successive chain of data transformations that were applied possibly across the connected systems to produce the dataset. In this case, a lineage analysis of the dataset across the connected systems is required. Thus, the growing amount of data that forms common data landscape of an organization, including both enterprise data and big data lakes, and the continuous trend of empowering users such as analysts and data scientists to access and prepare the data, has increased the necessity for lineage and impact analysis across networks of connected heterogeneous systems.