This invention relates to the field of map-reduce jobs. In particular, the invention relates to outputting map-reduce jobs to an archive file.
Map-reduce frameworks such as Apache's Hadoop (Apache and Hadoop are trademarks of The Apache Software Foundation) are well suited to reading and writing large quantities of data, using a cluster of machines to run map-reduce jobs, process the data, and provide a distributed file system to store data files. Map-reduce frameworks are designed to be able to scale to process more data without slowing performance. This is achieved by adding machines on which to run in parallel more instances of map or reduce tasks which can process the data in parallel.
Although distributed file systems allow map-reduce tasks to efficiently perform concurrent reads on a single file opened on the distributed file system, it is not possible for multiple map-reduce tasks running within a map-reduce job to concurrently update a single file stored on the distributed file system. For example, it is not possible for a map-reduce task to lock a region of a distributed file system file to update it.
A consequence of this is that it is difficult for a map-reduce job to be able to scale well and store results into a single output archive file (for example, a zip formatted file) that is portable and can easily be read by a wide variety of applications.
In data mining, there is just such a use case for building a “split” model on a big dataset, where the split model consists of an archive that consists of 100s of thousands or even millions of individual model files.
Typical approaches used in known map-reduce frameworks for scalable output are to: (i) store results in a distributed database system such as a NoSQL (Not Only Structured Query Language) database, which allows concurrent update; or (ii) spread the output across multiple distributed file system files (where each map-reduce task writes a separate file). However, neither of these techniques outputs a single archive file which is easy for other applications to consume.