1. Field of the Invention
The present invention generally relates to intermediate data handling of large scale data intensive computation. More specifically, when a memory sensor detects that a disk cache is stressed, a hybrid mode permits intermediate data to be stored directly into memory, thereby temporarily by-passing the disk cache.
2. Description of the Related Art
The rapid growth of the Internet and World Wide Web has led to vast amounts of information as available online. Additionally, businesses and government organizations create large amounts of both structured and unstructured information, all of which potentially needs to be processed, analyzed, and linked.
Data-intensive computing is a class of parallel computing applications in which large volumes of data uses a data parallel approach for processing. The data is typically terabytes or petabytes in size, often referred to as Big Data, and data intensive computing applications require large volumes of data and devote most of their processing time to I/O (input/output) and manipulation of data. In contrast, computing applications which devote most of their execution time to computational requirements are deemed compute-intensive.
Parallel processing approaches are sometimes generally classified as either compute-intensive or data-intensive. Compute-intensive application programs are compute bound, and such applications devote much of their execution time to computational requirements, as opposed to I/O, and typically require relatively small volumes of data. Data-intensive applications are I/O bound or with a need to process large volumes of data, and such applications devote much of their processing time to I/O and movement and manipulation of data. Parallel processing of data-intensive applications typically involves partitioning or subdividing data into multiple segments which can be processed independently, using the same executable application program in parallel on an appropriate computing platform, and then reassembling the results to produce the completed output data.
Current data-intensive computing platforms typically use a parallel computing approach combining multiple processors and disks in large commodity computing clusters connected using high-speed communications switches and networks which allows the data to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data. A cluster can be defined as a type of parallel and distributed system, which consists of a collection of inter-connected stand-alone computers working together as a single integrated computing resource. This approach to parallel processing is sometimes referred to as a “shared nothing” approach since each node, consisting of a processor, local memory, and disk resources, shares nothing with other nodes in the cluster.
A variety of system architectures have been developed for data-intensive computing, including the MapReduce architecture pioneered by Google, now available in an open-source implementation called Hadoop used by Yahoo, Facebook, and others. The MapReduce architecture and programming model allows programmers to use a functional programming style to create a map function that processes a key-value pair associated with the input data to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The system automatically handles details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, so programmers can easily use a large distributed processing environment even without having experience in parallel programming.
FIG. 1 exemplarily shows a programming model 100 for the MapReduce architecture where a set of input key-value pairs associated with the input data is received 101 and a set of output key-value pairs 108 is ultimately produced. In the Map phase 102-105, the input data 101 is partitioned into input splits and assigned 103 to Map tasks associated with processing nodes in the cluster. The Map task 104 typically executes on the same node containing its assigned partition of data in the cluster. These Map tasks perform user-specified computations on each input key-value pair from the partition of input data assigned to the task, and generates a set of intermediate results 105 for each key.
The shuffle and sort phase 106, 107 then takes the intermediate data generated by each Map task, sorts this data with intermediate data from other nodes, divides this data into regions to be processed by the reduce tasks, and distributes this data 106 as needed to nodes where the Reduce tasks will execute. The Reduce tasks 107 perform additional user-specified operations on the intermediate data, possibly merging values associated with a key to a smaller set of values, to produce the output data 108. For more complex data processing procedures, multiple MapReduce calls may be linked together in sequence.
Hadoop is an open source software project sponsored by The Apache Software Foundation, which implements the MapReduce architecture, and is fundamentally similar to the Google implementation except that the base programming language for Hadoop is Java instead of C++. Hadoop includes a distributed file system called HDFS, analogous to the GFS used in Google's MapReduce. The Hadoop implementation is intended to execute on clusters of commodity processors.
The present inventors have been investigating architecture commonly used for data-intensive applications that involve large amounts of data and are I/O bound, whether or not executed on a parallel platform, and have discovered a problem that is addressed by the concepts of the present invention.
More particularly, the present invention has resulted on testing and measurements on the intermediate data handling involving a disk cache, a transparent buffer of disk-backed file pages kept in a main memory (RAM) by the operating system for quicker access.
Following experiments that are further discussed below, the present inventors discovered that management of intermediate data, including the size of the disk cache, can play an important role in determining performance. These experiments have uncovered that there is a need for improving efficiency of intermediate data handling in large scale data intensive computations, and the present invention provides one solution to reduce these problems.
As noted above, although data-intensive applications are often associated in the art with parallel-processing using a cluster of computers, the concepts of the present invention is not intended to be limited to parallel processing. Rather, these concepts are useful for any computing applications that tend to be I/O bound and/or involve large amounts of data.