The present invention relates to in-memory data analytics, and more specifically, this invention relates to improving computational efficiency of in-memory data analytics by providing a unified cache and checkpoint functionality.
In-memory data analytics are an increasingly popular framework for processing data and providing various Internet-based services such as streaming applications, machine learning systems, graph-based analytics, maintaining structured query language (SQL) data sets and services, etc. As with any computational platform, the performance of these in-memory analytical frameworks is limited by the capabilities of the hardware components forming the framework, including the computational power of the processing nodes, storage capacity of memory, and input/output (I/O) bandwidth of various components.
For instance, any operation that requires transfer of data from local disk, remote memory, or remote disk to the local memory of a node, or even simply from local memory to in-process memory of the node processing data detrimentally contributes to the overall amount of time required to process a job.
Conventional approaches to in-memory processing have attempted to address this I/O detriment by using one or more caching techniques to manage the storage of data throughout processing. For example, compression/decompression techniques can reduce memory footprint of the dataset, but introduce additional I/O and processing time to compress/decompress the dataset. In addition, when decompression results in a dataset too large to fit in a processing node memory, spill-over to disk cache introduces additional processing delay associated with the corresponding I/O.
Caching between local and/or remote memory and disk-based storage can also manage the amount of data stored in a particular location, but the associated transfer operations needed to accomplish processing in local node memory are severely detrimental to the overall processing time for the job.
In addition, data being processed by a particular node using caching are generally not persistent, and data losses may delay or defeat the completion of a processing job, e.g. in the event of a failure of the processing node or critical components thereof (such as the memory storing the data).
To provide persistence, a distributed filesystem including a plurality of disk-based storage modules may be employed, and a snapshot of the state of data at a particular point in time may be captured at predetermined intervals to allow recovery of data in the event of a failure. However, these conventional techniques rely on storing the snapshot to lower-performance (high-reliability) hardware in a distributed filesystem, and retrieving snapshot data from a storage system detrimentally adds significant I/O and thus processing delay to the overall task.
For in-memory applications, especially those that function in whole or in part based on iteratively performing computations on a dataset, I/O and persistency are thus a significant contribution the functional performance of the in-memory processing system. Accordingly, it would be of great utility to provide systems, techniques, and computer program products capable of overcoming the traditional limitations imposed by conventional in-memory processing applications and improve the function of in-memory processing systems by enabling more computationally efficient I/O and data processing.