A MapReduce framework commonly manipulates records, such as, for example, by sorting map task outputs and merging sorting records from multiple map tasks. Challenges arise, however, in guaranteeing in-memory operation, which can be important for performance optimization. For example, some existing MapReduce implementations attempt to control the amount of memory used for storing the output of map results. If this amount of memory is sufficiently large, all output of a map task can be sorted in memory at the end of the task execution without involving input/output (I/O) operations on a physical disk. Otherwise, however, the map task would incur multiple “spills” and external sorting occurs. As used herein, a spill refers to the process of writing in-memory content to persistent storage (such as disks) to free up memory for new content.
Additionally, many existing approaches within the context of MapReduce systems use either a static value or a percentage of total available memory. However, with such approaches, the optimal static value or percentage value heavily depends on the specific application and/or input data. For example, the amount of data generated by a map task depends on the corresponding application logic and input data.
Further, other existing approaches include offline modeling and/or offline tuning of memory parameters. Such approaches include using batched profiling to collect performance data over multiple operation iterations, with each iteration having a different configuration. By building models with the collected data, a guided configuration value can be determined. However, such an offline technique requires multiple operation iterations before a useful value can be determined, and results cannot be reused for different applications and/or different input data.
Accordingly, a need exists for dynamic online tuning of memory in MapReduce systems.