Harnessing useful information and value efficiently and cost-effectively from processing already available data (stored or live feed) is fast becoming a key growth strategy for organizations having various sizes and end goals. Especially for large business enterprises having access to huge amounts of data (“Big Data”), data analytics is emerging to be one of the most important actionable considerations for marketing and current/future business development.
Data can be processed in real time, as the data stream becomes available, or data can be processed in batches (i.e. as an event). Traditionally, real-time processing and batch processing have had very different processing goals, resulting in very different processing systems and outcomes. Traditional stream-processing systems focus on a per-event processing framework (e.g., Twitter Storm, Berkeley River etc.). The assumption that events are not batched allows for simplifying decisions about how and when to perform processing. However, this assumption does not work well with larger batch processing systems (e.g., Hadoop Map-Reduce or data warehousing systems). Therefore, a key disadvantage of existing methods is that users have to maintain two different systems for processing real-time and/or near-real-time data and batch data and devise systems for integrating them manually or semi-manually.
Existing event/batch processing systems (e.g., data warehouses, Hadoop) offer minimal or zero support for managing the data being processed. Examples of missing management features include, but are not limited to, data retention and expiration, inherited security, and access auditing. The difficulty these systems often face is that they separate the concepts of data storage and management from the data processing layer. In other words, data passing through storage and management layers lose the inherent provenance and associated management policy.
There exists a handful of systems for tracking data provenance (e.g., Harvard PASS), however, these systems tend to be storage-centric, and therefore, may not be the most suitable for real-time processing.
Prior methods typically either try to manipulate the data through the processing layer using some kind of opaque cookie, or try to determine data origins using post-hoc analysis of processing behavior. The disadvantage of these approaches is a loss of provenance accuracy, as a trade-off for dealing with arbitrary computations. Therefore, what is needed is a completely accurate picture of data provenance, enabling a wide array of data management features. The system may focus on a more limited set of data processing computations, but techniques are required for increasing the overall efficiency of the data processing workflow management.