Generally, enterprise computing environments involve a plurality of distributed computing devices that coordinate over a network to execute one or more applications in one or more domains for the enterprise. For example, an enterprise computing environment may include a web-facing application (or portal) that communicates with an authentication server to validate logins to the application, a business intelligence (BI) database that stores relevant information used by the application, a web services application server that provides connectors to other computing devices and application workflow functionality, one or more low-level databases or data stores that provide data to the application, and so forth. Each of these computing elements may comprise data objects that store data used by the enterprise system, and the same or similar data objects may be used by multiple data sources—i.e., data may be ingested by the enterprise system by a first data source, when then relays all (or a portion of) the data object to other data sources in the system—such that a data object may be used by many different data sources as part of the application. The data flow connections between different data sources that involve the data object is known as a data lineage.
However, it is typically difficult to understand end-to-end data lineage information in large production computing systems for several reasons. First, such systems usually have incomplete (or missing) end-to-end data object information—as more complex applications are built, it becomes harder to keep track of how data is disseminated in the system. Second, there may be multiple formats for representing metadata about data sources or data objects in the production system—and such formats cannot easily be reconciled. Third, there is typically a lack of connection between terminology used at the application level and technical data used at the data source level—so changes to a high-level application cannot be assessed to determine potential impact on data objects or sources.
Existing solutions (such as Becubic™ from ASG Technologies, or Collibra™) have limited capability to perform data lineage, but typically do not leverage heterogeneous and/or unstructured data and metadata relating to the specific data sources and data objects in a system to assess and rank data object change impact in view of relationships between data sources, or depth (or distance) of an impacted data object from its input data source. In addition, such solutions do not use advanced machine learning algorithms and techniques to advantageously self-learn using existing data lineage information in conjunction with incident tickets (arising from data object errors) to discover both indirect relationships between data sources and assess the likelihood of failure if a data object is changed. As a result of the above deficiencies, there is no meaningful way to perform data lineage identification and data object change impact analysis in a production computing system.