Data curation involves creating knowledge from structured, semi-structured, and/or unstructured data sources. Large scale data curation flows involve multiple steps in which data from different sources are extracted, transformed, and linked with other sources to create meaningful knowledge.
Experience shows that curation flows for unstructured and semi-structured sources are complex. In a non-limiting example, in the financial domain, a curation flow may include 15 to 20 stages for handling 3 data sources and curating 15 concepts and their relationships.
Each stage in a curation flow can require, for example, about 10 seconds to about 100 seconds to be performed. Additionally, each stage includes multiple rules and/or algorithms that extract, transform, and/or link data in a meaningful way to create knowledge that can be consumed by various applications.
Given the complexity of large scale data curation flows, identifying problems in the flows and maintaining data quality is challenging. Accordingly, improved systems and techniques for identifying issues in a large scale data curation flow are needed.