The present invention relates generally to database systems, and more particularly to a system for maintaining lineage information for data stored in a database.
A relational database is a collection of related data that is organized in related two-dimensional tables of columns and rows wherein information can be derived by performing set operations on the tables, such as join, sort, merge, and so on. The data stored in a relational database is typically accessed by way of a user-defined query that is constructed in a query language such as Structured Query Language (xe2x80x9cSQLxe2x80x9d). A SQL query is non-procedural in that it specifies the objective or desired result of the query in a language meaningful to a user but does not define the steps to be performed, or the order of the steps in order to accomplish the query.
Moreover, very large conventional database systems provide a storehouse for data generated from a variety of locations and applications (often referred to as data warehouses or data marts). The quality and reliability of the storehouse is greatly effected by the quality and reliability of its underlying data. Because the data can originate from a variety of sources, the quality and reliability of data will often depend on the quality and reliability of the source. Moreover, the matter is further complicated because individual rows of data within a single table can originate from different sources.
Currently, if the data in a database is questionable, there is no easy way to track the history of the data to determine where it originate or how it may have been changed. As such, it would be advantageous to users of a database to have tools that allow the users to trace aspects of the history (i.e., where the data originated and how the data has been transformed) of the data in a database.
The task of tracing aspects of the history of data in a database is further complicated in enterprise-wide databases (such as data warehouses) where data may flow into the database from direct as well as indirect data sources, (i.e., the data may have been collected from another database that itself directly or indirectly derived the data). In other words, the data may have made multiple xe2x80x9chopsxe2x80x9d before reaching the destination database of interest.
As such, there is a need for providing method and apparatus for determining information about the history (i.e., lineage) of data contained within a database.
Briefly, the present invention is directed toward database technology that provides users with powerful tools necessary to manage and exploit data. The present invention provides a system and method for tracking the lineage of data within database tables. According to an aspect of the invention, data within the tables are tracked by attaching lineage information to the data, preferably, by adding a lineage identifier to each row in a table. Data that share a common lineage can be identified by virtue of sharing a common lineage identifier.
The lineage identifier can then be used to trace the source of the data, i.e., data having a common identifier share a common history. Additionally, the lineage identifier can provide details about transformations undergone by the data. For example, the lineage identifier can act as a pointer to a detailed history files of operations that were performed on the data to transform it into its current form. Preferably, the lineage identifier tracks program modules as well as specific versions of the program modules that transformed the particular data under consideration.
As a result of the data lineage mechanism, users can trace the history data in a table, even when that data has made several hops among databases, where the data has undergone one or more transformations, or where the transforming program modules have themselves under gone revision. This provides users with a powerful mechanism to have higher confidence in the quality and reliability of data in a database and to quickly trace and correct errors in the data.
Preferably, the lineage data type is an identifier that is universally unique and is optimized to provide little impact on the performance of the data base. For example, by providing a sufficient size identifier to ensure its uniqueness while minimizing storage size. More preferably, the data lineage data type is a sixteen-byte number.