1. Technical Field
The present invention relates to data management and query support in data analysis and, more particularly, to techniques for optimizing response time of queries about provenance of data elements that result from the analysis and transformation of input data streams.
2. Description of the Related Art
Data provenance involves the management of metadata about the history, generation and transformation of data. Data provenance is of special importance in large data processing systems in which data is operated on and routed between networked processing elements (PEs). The PEs in a stream processing system perform various operations on input data elements to generate output data elements. These output data elements are referred to as the results of the stream processing system. Examples of input data elements include packets of audio data, email data, computer generated events, network data packets, or readings from sensors, such as environmental, medical or process sensors. Examples of transformations conducted by individual PEs deployed on a stream processing graph include parsing a header of a network, aggregating audio samples into an audio segment or performing speech detection on an audio segment, subsampling sensor readings, averaging the readings over a time window of samples, applying spatial, temporal, or frequency filters to extract specific signatures over the audio or video segments, etc. The PEs produce results as a stream of output data elements or may produce individual output elements consumed by some external monitoring applications.
Data provenance applied to stream processing systems involves verification of the origins and causal factors of data produced by the system's PEs. A given data element that has a value of interest might lead to a query about the provenance of that datum, perhaps to determine why the data element has a particular value, or why the element was generated in the first place. The provenance query response requires an analysis of all upstream PEs and data consumed and generated by the upstream PEs, on which the datum of interest is dependent. Given the high data throughput of stream processing systems, a key challenge with managing provenance is the minimization of provenance query response times.
The standard approach for responding to provenance queries is to perform provenance function backtracing. In provenance function backtracing, each PE in a graph of processing elements maintains a provenance function that maps a given output event to a set of input events. When a query about a given output event occurs, the provenance function associated with the PE that generated the event is used to determine the precipitous input events. Once these input events have been identified, the provenance functions of the upstream analysis components which generated the input events are used to determine further upstream events that are indirectly related to the given output event. This process is repeated recursively until all relevant events have been identified.
Several points about provenance functions are worth noting. Most notably, provenance functions are distinct from the operations performed on input data streams by a processing element in that provenance functions map output data elements to sets of input data elements and, like PE operations, provenance functions can be mathematical functions and not simply relations. The fact that PE operations may not be functions and, more specifically, may not be invertible functions is a key motivator for why provenance functions are needed. Note further that while it is implicitly understood that PE operations are specified by an author of a PE, this may or may not be the case for a provenance function associated with a PE. A provenance function may be specified by the corresponding PE author, it may be specified by an author not responsible for the corresponding PE or the provenance function may be automatically generated using various techniques. These characteristics of provenance functions imply that a given output data element may be deterministically mapped to a specific set of input data elements during a provenance query event, though the corresponding PE operation may be non-invertible or even stochastic.
Provenance function backtracing can result in very inefficient provenance query responses. As described above, provenance functions map output events of a given PE to a set of input events for that PE. Given the time ordered nature of streaming data systems, the set of input events mapped to by provenance functions is referred to as a provenance input window. Due to the characteristics of provenance functions, as outlined above, the provenance input window may be conservatively specified such that only a small portion of the data contained within the window is directly relevant to the corresponding output event. The relevancy ratio is referred to as the ratio of the relevant provenance window data count to the provenance window size where the window size is the cardinality of the set of data events contained in the window. When the relevancy ratio of a provenance window is very small, this results in an unnecessarily large search space of data events to search through in response to a provenance query and the search space increases exponentially as the query traces upstream.
The degree of inefficiency of a provenance query depends both on the specification of the provenance function as well as the statistics of the input data with respect to the provenance function specification. Consider an example scenario in which a processing element consumes a single input stream of real number-valued data and produces an output event with a value that is equal to the average of the last ten input events that have had values greater than or equal to 50. If the stream of input data is such that most input events have values over 50, then on average the relevancy ratio will be high for each input window. If most input events are below 50, then on average the relevancy ratio will be low for each input window.
To further refine the example, assume a relevancy ratio of 1%, in this case, backtracing through a single processing element would produce, on average, input windows containing 1000 data events in which only 10 of the input events are directly relevant to a given output event. In a worst case scenario, as backtracing continues recursively upstream, this inefficiency will expand exponentially. Such inefficiencies result in slow provenance query response times since the space of data elements that must be searched to determine the provenance of a given output data event is unnecessarily large. Providing solutions to avoid this inefficiency are needed.
A significant amount of related work exists on providing solutions for infrastructures that manage provenance data. Such related work considers the best way to store provenance information independent of optimizing response time of data provenance queries. Rather, the focus of much of the previous work on data provenance considers whether provenance information should be stored as annotations attached to the appropriate data elements (see, e.g., K. Muniswamy-Reddy, D. Holland, U. Braun and M. Seltzer, Provenance-Aware Storage Systems, Proc. of the 2006 USENIX Annual Technical Conference, June 2006) or alternatively whether provenance information should be encoded in the processing elements of the data processing system (see, e.g., R. Bose, “A conceptual framework for composing and managing scientific data lineage”, 14th International Conference on Scientific and Statistical Database Management, SSDBM'02, pp. 15-19).
Prior systems do not teach how to store and manage input data elements that were responsible for producing certain final output elements/events so that the data provenance queries can be answered efficiently in a stream processing system. The problem of efficiently querying for provenance information is not addressed. Also, no technique for efficient store and retrieval of data provenance information for analytic methods whose output elements/events depend on a subset of the input data elements that satisfy the certain characteristics is disclosed or suggested.