Data warehousing systems typically collect data from multiple distributed data sources and store integrated and summarized information in local databases for efficient reporting, analysis or data mining. Often when using the data in a data warehouse, it is useful to identify the source data from which some particular data warehouse item was derived. For a given data warehouse item, the source or sources that were used to produce the data warehouse item and what and how other data were transformed to produce it is known as the data lineage of the data warehouse item.
In one approach to providing the data lineage of a data warehouse item, as each data item is loaded into the data warehouse the identity of the source of the data item is recorded. Unfortunately, documenting the source of the data item does not provide information regarding what and how other data were transformed to produce it. Furthermore, identifying the source is often a recursive problem, since the direct source of the data warehouse item may draw its information from another source, and so on, defining a path from source to source, often across numerous systems, until one or more root sources of data has been identified. And where the systems along the source path are developed at different times by different entities using different technologies, identifying the source path may be difficult.
In another approach the software jobs that are run to extract, transform, and load data into the data warehouse are analyzed to determine how data are read, filtered, translated, aggregated, and otherwise transformed and written. By examining these jobs the source data of a data warehouse item may be inferred. However, these jobs, which typically number in the dozens or hundreds, are often designed to be used with data sources whose identities only become known to the jobs at runtime, and thus the identity of the sources are not hard-coded into the job. Furthermore, these jobs often include logic that controls the flow of data based on conditions that exist as the job is run, and such conditions cannot be determined by examining a job design. Thus, attempting to determine data lineage from such an examination would require taking into account all possible data flow paths within a job, resulting in multiple possible lineages for any given data warehouse item, rather than a single true lineage.