In most modern enterprises, analyzing large amounts of data efficiently and quickly is important. One analysis tool is map-reduce style of program, including a map-phase, a shuffle-phase and a reduce-phase. In one example, in the map-phase, a primary node divides input data (problem) into subsets and distributes the subsets to processing nodes, wherein each processing node computes an intermediate output. In the reduce-phase, the processing nodes combine the results for all the subsets to form an output as the results (answer) to the input data. Between the map-phase and reduce-phase, in a shuffle-phase the data is shuffled (sorting and exchanging between nodes) in order to move the data to the node that reduces it.
A common pattern for systems that analyze large amounts of data is to have a central repository where data gathered from different sources are deposited regularly. This repository may contain a cleaned-up version of the original raw data, often in a binary form that is efficient to read and process. There are many ways to organize this repository on the file system. For example, the repository may be partitioned on an attribute, and data may be placed in a different directory for each value of the attribute.
For such organized repositories, it is important to be able to safely append to a set of files during a map-reduce job so that the files continue to be readable under various failure scenarios. Map-reduce systems such as Hadoop typically use a non-POSIX file system (e.g., Hadoop Distributed File System (HDFS)) that can only be appended to. Consequently, it is not possible to delete the partially appended data that may be in different files when various tasks of jobs fail. Simply creating (potentially small) new files each time under the organized structure can have substantial overhead in a file system designed for large files.