1. Technical Field
The present invention relates to data usage in a stream-processing system and more particularly to systems and methods which determine data usage based on provenance dependency information, which is employed to manage data retention.
2. Description of the Related Art
A stream-processing application can be described in the form of a dataflow graph, which includes application components called PES (processing elements), interconnected by streams. A stream includes output data elements from one PE that serve as the stream of input data elements to another PE. An application may thus be abstractly modeled as a directed graph, with each vertex of the graph representing a PE and the edges between graphs establishing the bindings between sources and sinks of streams of data.
PEs perform various operations on input data elements to generate output data elements. These output data elements are referred to as the results of the stream processing system. Examples of input data elements include packets of audio data, email data, computer generated events, network data packets, or readings from sensors, such as environmental, medical or process sensors. Examples of transformations conducted by individual PEs deployed on a stream processing graph include parsing the header of a network, filtering samples that are not relevant to the results being computed, aggregating audio samples into an audio segment or performing speech detection on an audio segment, sub-sampling sensor readings, averaging the readings over a time window of samples, applying spatial, temporal or frequency filters to extract specific signatures over the audio or video segments, etc. These PEs produce results as a stream of output data elements or may produce individual output elements consumed by some external monitoring applications.
Note that in such applications, it is typical that a large volume of input data is discarded as being irrelevant to the results being computed. For example, many sensor readings may report redundant readings or readings that indicate nothing abnormal and may be irrelevant to applications looking for abnormal events.
Stream-processing applications are run on stream-processing middleware that offers the streaming services such as the interconnection of PEs and shipping of data elements. In such systems, there is a causal or provenance dependency relationship between the input and output data of a PE. Usually this information is used to answer queries that determine the origins and transformations of data. In a streaming system context, an example provenance query might be to determine the sequence of data elements and the PEs that generated a given result, such as, for example, a set of output data elements. Alternatively, another provenance query might be to additionally determine the specific set of (often a hierarchy of upstream) data elements, generated by an appropriate set of PEs lying upstream in the application processing graph, that generated a given result, such as, for example, a set of output data elements. Data provenance is of special importance in large data processing systems in which data is operated on and routed between networked processing elements (PEs). In many situations, it is important to verify the origins and causal factors of data produced by such a cascaded application of distributed PEs.
An additional characteristic of stream processing systems is that in such systems, data-processing occurs in successive processing steps as PEs perform incremental information extraction, throw away data that is irrelevant to the final application result, and progressively refine the data to finally compute the results. A given output data element, therefore, might have been derived from a small sample of the large volume of original data. A provenance query on a given output data element that has a value of interest, might be to determine why the data element has a particular value, or why and how the element was generated in the first place.
Such provenance queries can be difficult to compute for several reasons. First, it is often the case that a graph of networked processing elements is dynamic. Links between the PEs may be added and removed over time and the PEs may be replaced according to changing processing needs. Such mutability implies that the processing path, including the PEs and the associated streams or data elements, involved in the generation of a given data element is subject to variation in time and hence, requires a system for keeping track of the system changes and based on that, determine which data is relevant to results.
Second, the PEs involved in the processing of data in an application, are not aware of their downstream data consumers, which may evolve constantly. Hence, as PEs produce output data elements, they cannot predict which of their output data elements may be relevant to downstream processing elements. Traditional data processing systems conservatively store all the data produced by intermediate steps and apply the provenance dependency functions while answering provenance queries, to determine the relevant input data elements. This approach may be too expensive or infeasible in stream processing systems where streams are potentially endless.
Finally, many of the processing systems operate on large volumes of data, generated by variable numbers of data streams. Given the high volume and data rates, it is essential that the provenance technologies impose low additional overhead on both the data storage and the processing complexity.
For at least these three reasons, it would be advantageous to provide a method that can determine the relevance of any piece of data to results produced, during runtime and a system that can manage data in a storage-efficient manner, to answer provenance and other data usage-based queries in such high-speed stream-processing systems.
The majority of the previous work on data provenance has fallen into two broad categories. Scientific and web-service workflows, including systems such as Karma, see, Y. L. Simmhan, B. Plale and D. Gannon, Performance Evaluation of the Karma Provenance Framework for Scientific Workflows, International Provenance and Annotation Workshop (IPAW), May 2006, and PreServ, see, P. Grath, M. Luck, L. Moreau, A protocol for recording provenance in service-oriented grids, Proc. of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), December 2004, are designed to capture interactions among various components for data-driven scientific workflows, such as atmospheric sensing and genomic computing. Similarly, systems such as PASOA are designed for web services environments and focus purely on process provenance; specifically, they store the history of inter-component interactions, such as, for example, SOAP invocations, rather than the actual transformation of the datasets or the actual datasets consumed by a specific web service.
A survey of various techniques for provenance in scientific environments is provided in Survey of Data Provenance in e-Science (SigMod). In general, all of the mechanisms for capturing provenance use logging and auditing mechanisms to track dependencies of entire streams and also rely on the fact that the entire dataset can be stored. Some of the data provenance systems presented in SigMod use the annotation approach, whereby the system tracks all the provenance information for each data item separately and stores this as part of the metadata associated with each individual data item. Such an annotation approach is reasonable for scientific data sets, as many of the data items, such as, for example, astronomy observations or genetic sequences, are very large in size, and the additional provenance-related information constitutes a very small overhead.
In contrast, each individual element in a stream-based system is very small, the volume of such elements is very large and the streams are potentially endless. This makes annotation-based systems impractical due to their prohibitive storage and per-element processing overhead.
Another approach to process provenance is described in the work of R. Bose, “A conceptual framework for composing and managing scientific data lineage”, 14th International Conference on Scientific and Statistical Database Management (SSDBM'02), pp. 15-19, which tries to find the creators of source data to verify copyrights. This is achieved by a conceptual framework that helps identify and assess basic lineage among system components. In summary, the existing techniques determine the provenance at the coarse granularity of streams, rather than at the level of data.
Provenance techniques in File Systems and Databases, including approaches such as PASS, see, K. Muniswamy-Reddy, D. Holland, U. Braun and M. Seltzer, Provenance-Aware Storage Systems, Proc. of the 2006 USENIX Annual Technical Conference, June 2006, and LinFS, are typically annotation-based in that they associate provenance metadata with individual data items, such as files or database (DB) records and also rely on the fact that all the data can be stored. As an example, PASS automatically stores the modification history of files, including information on the calling application, the file descriptor table, etc.
There is some limited work on the topic of supporting provenance tracking in stream-based systems. One approach towards such provenance tracking was described in N. Vijayakumar et al., “Towards Low Overhead Provenance Tracking in Near Real-time Stream Filtering,” International Provenance and Annotation Workshop, 2006, which dynamically constructs a dependency tree from base streams to derived streams, where each derived stream is expressed as an adaptive filter over multiple base or derived streams. For each stream, dynamic provenance information is collected as a series of time-stamped events. That is, as and when a filter detects an “event”, it pushes a time-stamped record about the change to its stack. Later, when the provenance has to be retrieved, the provenance tree can be traversed followed by the stack to determine the events that led to a derived event. This approach tries to associate provenance information at the stream-level, rather than trying to establish specific dependencies between individual elements of derived streams and corresponding subsets of data from base streams.
The notion of a ‘dependency function’ has been defined in some other contexts not related to provenance, notably for optimistic recovery from faults in a distributed multi-processor system. For example, U.S. Pat. No. 4,665,520, defines a method where each process (Pi) in a distributed system store a set of messages (since the last commit) that other processors (Pj) might depend on. Only after Pj has committed and migrated to state Pj (t+1), will Pi remove the set of messages (defined in the set interval (Pi(t)). In case Pj fails, the system allows Pj to recreate its state by “replaying” the set of dependent messages (in the dependency vector) list the last commit. In U.S. Pat. No. 4,665,520, a dependency function is used only to enable message replay between specific components. A similar mechanism for application recovery from failures, is also presented in “High-Availability Algorithms for Distributed Stream Processing”, by Jeong-Hyon Hwang, M. Balazinska, A. Rasin, U. Cetinternel, M. Stonebraker and S. Zdonik, at ICDE 2005, Tokyo, Japan. In Hwang, upstream PEs hold data elements that they forward to downstream PEs. As and when the downstream PE acknowledges that it has processed the data elements, the upstream PE drops them. If the downstream PE fails and recovers, the upstream PE plays back the unacknowledged data so that the downstream PE can recover its state.
Stream processing systems are characterized by high data rates in which streams of data events consist of a set of events that are logically related and are sequentially ordered, and also where a large set of the input data is irrelevant to the final output produced. Most techniques presented to date assume either that all the data can be stored, in which case, an annotation based approach is used; or in cases where the data cannot all be stored, they resort to a process-oriented approach, where only the stream-level relationships are stored. In high-speed stream processing systems, it is not practical to store all the data, and a process-oriented approach is insufficient to answer questions about the dependencies among the data elements themselves.