Large computer systems can gather and analyze data generated from a large number of different sources. Extremely large data sets may be analyzed computationally to reveal patterns, tends, and associations. Such large data sets are often referred to as “big data.” Big data tools can analyze high-volume, high-velocity, and high-variety information assets far better than conventional tools and relational databases that struggle to capture, manage, and process big data within a tolerable elapsed time and at an acceptable total cost of ownership.
In large computer systems, there are often many steps, from where data is generated to where data is consumed, which are typically accomplished by various computing tools that handle data movement and transformation so that all the data becomes consumable when they reach a final big data analytics tool.
The source of data, and also the reliability and trustworthiness of the source, can affect how the data is analyzed. Data from less reliable sources may still be useful, but must be carefully handled, especially in combination with data from more reliable sources. It is difficult, if not impossible, however, to determine the reliability and trustworthiness of data when the source of the data cannot be determined. It would be useful to be able to identify where data comes from and how that data has been moved and transformed. In other words, it would be useful to trace data lineage from end-to-end so that data quality problems can be determined and addressed.