Traditional data warehouse systems employ a “top down” or “schema on write” approach to collect and store data according to a predefined schema. A predefined schema can provide a logical structure to the data that can enable efficient reporting and analysis in some situations. However, a “schema on write” approach to data processing requires a substantial investment in initial planning and design to come up with the schema that will be utilized to organize the data. Effective planning and design will typically require comprehensive knowledge of the data to be collected, the users and organizations that will utilize the data, and the purposes and goals of using the data. As the scale of data being stored and processed continues to increase and the manner in which such data is used continues to evolve, data warehouse systems implementing a “schema on write” approach become increasingly more difficult to design, more cumbersome to manage, and more difficult to change to adapt to user needs.
A bottom up or “schema on read” approach differs from the “schema on write” approach used in traditional data warehouses in that the schema used to organize and process the data is only applied at the time of reading the data. In other words, structure is applied to otherwise unstructured data when it is read, for example, to query the data or perform other processing jobs. Large scale data technologies, such as Apache Hadoop™, typically employ this “schema on read” approach to allow users to effectively utilize large amounts of unstructured data without having to invest the time and effort to create a predefined schema for structuring the data when writing the data to storage. However, as the amount of data grows exponentially, there is a need for automatic collection, visualization, and utilization of upstream and downstream data lineage in these distributed database system (e.g., to verify the system's reliability or to further optimize or reconfigure the system).