Data provenance involves the management of metadata about the history, generation and transformation of data. Data provenance is of special importance in large data processing systems in which data is operated on and routed between networked processing components (PCs). In many situations it is important to verify the origins and causal factors of data produced by such a cascaded application of distributed PCs.
A given data element that has a value of interest might lead to a query about the provenance of that datum, perhaps to determine why the data element has a particular value, or why the element was generated in the first place. Such provenance queries can be difficult to compute for several reasons. First, it is often the case that a graph of networked processing components is dynamic. Links between the PCs may be added and removed over time and the PCs may be replaced according to changing processing needs. Such mutability implies that the processing path, including the PCs and the associated streams or data elements, involved in the generation of a given data element is subject to variation in time and hence, requires a system for keeping track of the system changes.
A second difficulty with provenance queries is that processing networks often consist of a large set of PCs with a large set of stakeholders involved in the design, implementation and selection of the PCs. Given the many players involved in the creation and execution of a data processing network, maintaining a consistent design and implementation approach to the PCs becomes challenging and results in responses to provenance queries that may not be consistent across the network of components.
Finally, many of the processing systems operate on large volumes of data, generated by variable numbers of data streams. Given the high volume and data rates, it is essential that the provenance technologies impose low additional overhead on both the data storage and the processing complexity. For these three reasons, special attention is required to design a storage-efficient provenance management system that responds to provenance queries in a timely manner, and is yet expressive enough to capture many common cases of dependencies typical in stream processing systems.
In a stream processing system, applications are deployed as a network of PCs, which perform various operations on input data elements in order to generate output data elements. These output data elements are referred to as the results of the stream processing system. Examples of input data elements include packets of audio data, email data, computer generated events, network data packets, or readings from sensors, such as environmental, medical or process sensors. Examples of transformations conducted by individual PCs deployed on a stream processing graph include parsing the header of a network, aggregating audio samples into an audio segment or performing speech detection on an audio segment, subsampling sensor readings, averaging the readings over a time window of samples, applying spatial temporal or frequency filters to extract specific signatures over the audio or video segments, etc. These PCs produce results as a stream of output data elements or may produce individual output data elements consumed by some external monitoring applications.
A stream-processing “application” in such stream-oriented systems consists of a network of PCs, where the stream of output data elements from one PC serves as the stream of input data elements to another PC. An application may thus be modeled as a directed graph, with each vertex of the graph representing a PC and the edges between graphs establishing the bindings between sources and sinks of streams of data. An example provenance query might then be to determine the sequence of processing PCs that generated a given result, such as, for example, set of output data elements. Alternatively, another provenance query might be to additionally determine the specific set of (often a hierarchy of upstream data elements) data elements, generated by an appropriate set of PCs lying upstream in the application processing graph, that generated a given result, such as, for example, a set of output data elements.
The majority of the previous work on data provenance has fallen into two broad categories. Scientific and web-service workflows, including systems such as Karma, see, Y. L. Simmhan, B. Plale and D. Gannon, Performance Evaluation of the Karma Provenance Framework for Scientific Workflows, International Provenance and Annotation Workshop (IPAW), May 2006, and PreServ, see, P. Groth, M. Luck, L. Moreau, A protocol for recording provenance in service-oriented grids, Proc. of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), December 2004, are designed to capture interactions among various components for data-driven scientific workflows, such as atmospheric sensing and genomic computing. Similarly, systems such as PASOA are designed for web services environments and focus purely on process provenance; specifically, they store the history of inter-component interactions, such as, for example, SOAP invocations, rather than the actual transformation of the datasets or the actual datasets consumed by a specific web service. A survey of various techniques for provenance in scientific environments is provided in Survey of Data Provenance in e-Science (SigMod). In general, all of the mechanisms for capturing provenance use logging and auditing mechanisms to track dependencies of entire streams rather than windows of data.
Some of the data provenance systems presented in SigMod use the annotation approach, whereby the system tracks all the provenance information for each data item separately and stores this as part of the metadata associated with each individual data item. Such an annotation approach is reasonable for scientific data sets, as many of the data items, such as, for example, astronomy observations or genetic sequences, are very large in size and the additional provenance-related information constitutes a very small overhead. In contrast, each individual element in a stream-based system is very small, and the volume of such elements is very large—this makes annotation-based systems impractical due to their prohibitive storage and per-element processing overhead.
Another approach to process provenance is described in the work of R. Bose, “A conceptual framework for composing and managing scientific data lineage”, 14th International Conference on Scientific and Statistical Database Management (SSDBM'02) pp. 15-19, which tries to find the creators of source data to verify copyrights. This is achieved by a conceptual framework that helps identify and assess basic lineage among system components. In summary, the existing techniques determine the provenance at the level of the streams, a coarse granularity.
Provenance techniques in File Systems and Databases, including approaches such as PASS, see, K. Muniswamy-Reddy, D. Holland, U. Braun and M. Seltzer, Provenance-Aware Storage Systems, Proc. of the 2006 USENIX Annual Technical Conference, June 2006, and LinFS, are typically annotation-based in that they associate provenance metadata with individual data items, such as files or DB records. As an example, PASS automatically stores the modification history of files, including information on the calling application, the file descriptor table, etc.
Another example of provenance in databases lies in the work in Y. Cui et al., “Practical Lineage Tracing in Data Warehouses,” in ICDE, 2000, on tracing the data lineage obtained by view-based transformations in relational databases. This work describes how the source data can effectively be reconstructed by ‘inverting the query’ that defines a derived view, when the operations fall in the ASPJ (Aggregate-Select-Project-Join) operator category.
There is some limited work on the topic of supporting provenance tracking in stream-based systems. One approach towards such provenance tracking was described in N. Vijayakumar et al., “Towards Low Overhead Provenance Tracking in Near Real-time Stream Filtering,” International Provenance and Annotation Workshop, 2006, which dynamically constructs a dependency tree from base streams to derived streams, where each derived stream is expressed as an adaptive filter over multiple base or derived streams. For each stream, dynamic provenance information is collected as a series of time-stamped events. That is, as and when a filter detects an “event”, it pushes a time-stamped record about the change to its stack. Later, when the provenance has to be retrieved, the provenance tree can be traversed followed by the stack to determine the events that led to a derived event. This approach tries to associate provenance information at the stream-level, rather than trying to establish specific dependencies between individual elements of derived streams and corresponding subsets of data from base streams. In particular, Vijayakumar does not provide the notion of having a dependency function be explicitly specified for each output port of a PC, and does not describe how specific external state that affects the functional dependency can be tracked and used in the provenance derivation process.
The notion of a ‘dependency function’ has been defined in some other contexts related to provenance, notably for optimistic recovery from faults in a distributed multi-processor system. For example, U.S. Pat. No. 4,665,520, defines a method where each process (Pi) in a distributed system store a set of messages (since the last commit) that other processors (Pj) might depend on. Only after Pj has committed and migrated to state Pj (t+1), will Pi remove the set of messages (defined in the set interval(Pi(t)). In case Pj fails, the system allows Pj to recreate its state by “replaying” the set of dependent messages (in the dependency vector) list the last commit. While this patent does define the notion of a “dependency function,” such a dependency function is used only to enable message replay between specific components.
Such stream processing systems are characterized by high data rates in which streams of data events consist of a set of events that are logically related and are sequentially ordered. Unfortunately, simple application of an annotation-based approach or a process-based approach is not sufficient for streaming data systems. Due to the high data rates associated with streaming systems, the annotation approach is not sufficient because the large volume of data will require equally large volumes of provenance metadata. Due to the time-varying nature of the streaming systems, a static process-oriented approach will be unsatisfactory because a given processing component's linkages to other PCs may vary over time as changes in the network topology occur. Moreover, a process oriented approach is insufficient to answer questions about the dependencies among the data elements themselves, which may be needed in many scenarios involving the automated processing of sensor data streams. Therefore, a novel hybrid provenance management system is needed that efficiently addresses the challenges of stream-oriented data processing systems.