The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for caching a block of data in a multi-tenant cache storage device based on space usage boundary estimates.
The volume of data to be processed in the field of “Big Data” is growing at an unprecedented rate at the same time that analysis is becoming more computation intensive. In order to support emerging distributed processing applications, extreme-scale memory and increased computational power are required. The complexity and computation needs of such applications lead to performance bottlenecks in conventional architectures.
To address these limitations, technologies have been developed to improve the speed by which distributed data processing may be performed. One such technology is Apache Spark. Apache Spark is a fast, in-memory data processing engine which provides Application Programming Interfaces (APIs) that allow data workers to efficiently execute streaming, machine learning, or Structured Query Language (SQL) workloads that require fast iterative access to datasets. Apache Spark may be run on Apache Hadoop YARN and is designed for making data science and machine learning easier to implement. Apache Spark consists of a Spark Core and a set of libraries, where the Apache Spark Core is a distribute execution engine offering a platform for distributed ETL application development.
A fundamental data structure of Apache Spark is the Resilient Distributed Datasets (RDD). Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. A Resilient Distributed Dataset (RDD) is a resilient and distributed collection of records spread over one or more partitions. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs, i.e. parallelizing an existing collection or referencing a dataset in an external storage system, such as a shared file system, Hadoop Distributed File System (HDFS), Apache HBase, or any data source offering a Hadoop Input Format.
RDD allows programmers to perform in-memory computations on large clusters in a fault-tolerant manner. RDD is resilient in that it is fault-tolerant with the assistance of RDD lineage graphs and so is able to re-compute missing or damaged partitions due to node failures. RDD is distributed with data residing on multiple nodes in a cluster and is a dataset because it comprises a collection of partitioned data with primitive values or values of values, e.g., tuples or other objects that represent records of the data. In addition to the above characteristics, RDD also has the following characteristics, among others:
(1) data inside a RDD is stored in memory as much and as long as possible;
(2) the data inside RDD is immutable in that it does not change once created and can only be transformed using RDD transformations;
(3) all of the data inside RDD is cacheable in a persistent memory or disk storage;
(4) allows for parallel processing of the data based on RDD partitioning;
(5) data, or records, inside the RDD are partitioned into logical partitions and distributed across nodes in a cluster, where the location of partitions may be used to define Apache Spark's task placement preferences.