1. Technical Field
The present teaching relates to methods, systems and programming for data management. Particularly, the present teaching is directed to methods, systems, and programming for heterogeneous data management.
2. Discussion of Technical Background
Modern systems are often comprised of multiple semi-independent sub-systems. For example, there are three types of systems that are commonly used. The first type is the so-called stacked systems with higher-level abstractions stacked upon lower-level systems. The second type of system is pipelined systems, in which data flows through a sequence of systems, e.g. a system for ingesting Really Simple Syndication (RSS) feeds, a system for processing the feeds, then a system for indexing and serving the feeds via, e.g., a search interface. A third type of system is called side-by-side systems. For example, fault tolerant systems are usually side-by-side systems, i.e., two or more systems providing the same function may operate side-by-side at the same time. Side-by-side systems are often deployed during a migration period, in which responsibility is often transferred from one system to another in a gradual manner to allow the new system to be vetted and fine-tuned. In another scenario, redundant systems are usually deployed in a permanent side-by-side configuration, with each one targeting a different point in some performance tradeoff space such as latency versus throughput.
Modularity in these forms facilitates the creation of complex systems, but can complicate operational issues, including monitoring and debugging of end-to-end data processing flows. To follow a single RSS feed from beginning to end may require interacting with half a dozen sub-systems, each of which likely has different metadata and different ways of querying it. Solutions that rely on standardization efforts or deep code modifications are often cost prohibitive and usually unrealistic especially when third-party components are used.
Arguably the most complex type of metadata to manage is data provenance. A system that aims to integrate provenance metadata from multiple sub-systems frequently has to deal with the inherent nonuniformity and incompleteness. To begin with, different sub-systems often represent data and processing elements at different granularities. For example, data granularities may range from tables (coarse-grained) to individual cells of tables (fine-grained), with multiple possible mid-granularity options such as rows versus columns versus temporal versions. Process descriptions also run the gamut from coarse-grained (e.g. an SQL query or Pig script) to fine-grained (e.g. one Pig operator in one retry attempt of one map task), with multiple ways to sub-divide mid-granularity elements, e.g., map and reduce phases versus Pig operations (which may span phases) versus parallel partitions.
Moreover, links among processing and data elements sometimes also span granularities. For example, one system may record a link from each (row, column group, version) combination to an external source feed such as Rotten Tomatoes. One example is to record a link related to the latest release date and opening theater for movie “Inception”. Furthermore, frequently, each sub-system does not provide a complete view of its metadata, for example, since metadata recording may be enhanced over time as new monitoring and debugging needs emerge. Recording all metadata at the finest possible granularity sometimes imposes unacceptable burden and performance overheads on both a system that produces the metadata and the system that captures and stored the metadata.
Provenance metadata management has been studied in the database and scientific workflow literature, including the notion of offering provenance management as a first-class service, distinct from data and process management. However, most prior work on provenance has focused on tracking a single system's provenance metadata, and consequently has generally assumed that provenance metadata is rather uniform, and/or can be tightly coupled to the data in one system. But in actuality, this is hardly the case. Therefore, there is a need to provide a framework for integrated management of provenance metadata that spans a rich, multi-dimensional granularity hierarchy.