Driven by the inevitable trend towards the cloud, more and more real-time in-memory computing applications are being served by large-scale parallel processing platforms (e.g., Hadoop). As a result, large-scale parallel processing platforms must employ distributed in-memory data storage systems to realize data sharing and exchange among different in-memory computation frameworks and jobs. Distributed in-memory data storage systems form a large-scale distributed cache layer sitting between in-memory computation frameworks/jobs and persistent storage systems (e.g., Amazon S3 and HDFS).
However, in-memory storage is fundamentally subject to two cost issues, and how well one can tackle these two cost issues largely affects the overall system performance of future large-scale parallel processing platforms: (1) Memory resource cost: It is apparent that in-memory data storage tends to occupy a large amount of memory capacity. This will become increasingly significant as more and more memory-centric data processing tasks are being migrated onto a single large-scale parallel processing platform. This directly results in memory resource confliction between the application layer and the underlying in-memory data storage layer. In spite of the continuous scaling of DRAM beyond the 20 nm node and the maturing new low-cost memory technologies (e.g., 3D XPoint), the ever-increasing demand for more memory capacity will keep memory as one of the most expensive resources. Hence, it is highly desirable to minimize the memory capacity (and hence cost) overhead induced by in-memory data storage systems. (2) Computation cost: Different from a traditional buffer pool mechanism, in-memory data storage systems hold the data in a storage-oriented format (e.g., JSON, Parquet, and ORC) other than as in-memory objects. Therefore, when moving data across the application layer and in-memory storage layer, data format conversion is required and can result in significant computation cost. In addition, as the most obvious option to reduce memory capacity overhead of in-memory data storage, data compression inevitably leads to further computation costs. This directly results in computation resource confliction between the application layer and the underlying in-memory data storage layer.