Data parallel application frameworks for large scale applications, such as Hadoop, Storm, and Spark, process a large volume of data by partitioning the data among the nodes in a compute cluster. These frameworks expose a functional model to the application developer and manage state information of the partitions internally. By exposing a functional model, the system can account for node failures while executing an application by moving partitions to a live node.
General programming models may generally account for two very broad but interrelated categories: data structures and algorithms. Data structures represent the model used to store and retrieve data, while algorithms represent the procedures that operate on data. In each of the previously-mentioned frameworks, the programming model exposes a rich interface for developing algorithms, but a very limited interface for exposing data structures.
Hadoop, for example, allows any general algorithm that operates on a key value pair, called a “map,” or on a key and a list of values, called a “reduce.” The implicit data structure in this model is commonly referred to as a “multimap.” Spark limits the capabilities to transformations that take a list of key-value pairs and produce another list of key value pairs. Storm has no data storage capabilities at all.
What is lacking in each of these models is a general data structure, or set of data structures, that may be used for operations like random access, array lookup, list iteration, search, etc., but may expose an interface that hides partition state information so that the system can manage failures.