In many common configurations, big data and analytics do not work with the entire dataset, but rather a subset or derivation of the dataset. Most often, the analysis is done by creating views of the data (subsets of the entire data) in advance. Furthermore, these views are created as part of a pre-processing data ingestion step referred to as Extract, Transform, Load (ETL).
ETL is the process where the original dataset is read, relevant data to the analytic process is extracted, cleaned, transformed and changed to best fit the analytical process and then loaded into the analytics processing system. While such a process assists with running in an efficient manner, it introduces a space overhead for keeping the derived data. Moreover, the process is repeated for each type of analytical process. An example of such a process entails reading the complete dataset, filtering only the necessary data, and copying it to a new file/storage location. As such, the creation of views consumes time and memory, as well as the extra storage space required to store the view since data is essentially duplicated.
SCM is a family of persistent, byte addressable memory which offers low latency and high volume characteristic. There are a growing number of technologies in this family including Flash backed DRAM, 3DXPOINT, MRAM, ReRAM, and others.
Data reduction technologies are used to reduce storage needs. Compression uses in-text back pointers or dictionary table references to reference repeating strings. De-duplication uses pointers to repeated large strings (typically blocks). Compression and de-duplication have drawbacks resulting from the granularity of the repeating patterns. Furthermore, compression techniques are limited in their temporal reference, whereas de-duplication needs a large enough repeated text to be efficient.
In some systems, such as Apache Spark, a data lineage relationship is maintained. In the data lineage relationship, one dataset is described by a set containing a parent dataset, and transformation metatdata. When the dataset is used (e.g., for output), then the dataset is computed by reading the parent data and executing the transformations described in the transformation metadata. The central difference is that such a lineage is not persistent, and cannot be reused for additional computations. If it is persisted, it is kept as an entirely new object, or a snapshot of the memory or cache mechanism, which requires additional storage space. Additionally, the lineage process typically requires reading the entire parent dataset, or at the very least, a large and sequential portion of it.
Therefore, a system to reduce the amount of storage and memory used to maintain the derived datasets as well as reduce the creation time of the views when SCM is the underlying storage is needed.