Technical Field
The invention relates to computer file systems. More particularly, the invention relates to a map-reduce ready distributed file system.
Description of the Background Art
Distributed cluster computing using the map-reduce style of program was described by Jeffrey Dean and Sanjay Ghemawat. See, J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, USENIX Association (2004). In this style, computation is broken down into a map phase, a shuffle phase, and a reduce phase. FIG. 1 shows a simplified schematic of this form of computation. An input 101 is divided into pieces referred to as input splits. Each input split is a contiguous range of the input. Each record in each input split is independently passed to instances of a map function 102, represented herein as f1. This map function is defined to accept a single record as an input and to produce zero or more output records, each of which contains a key and a value. The output records from the map functions are passed to the shuffle 103 which rearranges records so that all values with the same key are grouped together. Instances of the reduce function 104 are represented herein as f2. The reduce function is defined to take two arguments, the first being a key value and the second being a list of values. The output of f2 consists of zero or more records which are stored in output files 105.
This style of computation provides sufficient generality to be widely useful for processing large scale data, while simultaneously having simple enough semantics to allow high degrees of failure tolerance. However, map-reduce programs impose severe loads on file systems that are difficult to support with conventional file systems.
The original map-reduce implementation at Google (see U.S. Pat. No. 7,650,331) was accompanied by a write-once file system referred to as GFS. Subsequently, the Apache Hadoop project has built a rough clone of Google's map-reduce known as Hadoop. Associated with Hadoop is a file system known as the Hadoop Distributed File System (HDFS) that fills the same role as GFS.
Both GFS and HDFS are write-once file systems that adopt replication across several machines as a reliability mechanism over more traditional error correcting methods, such as RAID. The write-once semantics of both systems makes replication a relatively simple strategy to implement. The replication also allows map-phase tasks to be placed near a copy of the data being read, giving a substantial performance boost due to the fact that disk access is generally considerably faster than network access.
Both Google's map-reduce and Hadoop use local file systems during the shuffle phase largely because it is difficult to support the file-create loads imposed by the shuffle. For instance, a large computation with 10,000 map splits and 1000 reducers produces 10 million output partitions. The simplest implementation of the shuffle would use the distributed file system to store each of these partitions in a separate file. Such an approach makes the shuffle operation almost trivial, but it requires that the cluster be able to create these millions of files within a few seconds. Unfortunately, HDFS is limited to a file creation rate of at most a thousand files per second and GFS is also limited in this respect. These limits occur because a central meta-data server handles meta-data and block location lookup in both HDFS and GFS. The implementation choice to use a central meta-data and location server is forced by the write-once nature of the file system because file meta-data is highly mutable.
Storing shuffle partitions as local files is also not feasible in either Hadoop or GFS because the local file systems cannot support the simultaneous access to tens of thousands of files by multiple processes. The constraints imposed by the local file system have lead to complex shuffle implementations that are very difficult to get to a bug-free state and that are difficult for users to tune for performance.
Systems such as Hadoop also suffer severe performance penalties when large numbers of small to medium sized files are stored in the system. The write-once nature of the files, combined with the desire for large files and the need for data to be integrated within minutes of receipt often leads to applications which record data for short periods of time and then repeatedly concatenate files to form large files. Managing the concatenation and safe deletion of small files is time consuming and wastes large amounts of resources. There are estimates that as much as half of the cluster capacity at companies such as Twitter and Facebook is devoted to the concatenation of files in this fashion.
The history of distributed file-systems is long and varied but for the key design points of a map-reduce ready distributed file system a small number of systems can be used to illustrate the state of the art. None of these systems meets the need for full support of a map-reduce cluster in terms of transactional integrity, read/write access, large aggregate bandwidth, and file-create rate. More importantly, the methods used in these systems to meet one or more of these requirements separately make it impossible to meet the other requirements. This means that it is not possible to meet all of the requirements by simply combining methods from these systems.
As discussed above, GFS and HDFS provide write-once, replication-based file systems. The use of replicas provides high bandwidth, but makes transactional integrity in a read/write environment difficult. This motivates the write-once design of these systems and that write-once nature forces the use of a central meta-data server. Central meta-data servers, in turn, make it nearly impossible to meet the file create rate requirements. Thus, the mechanism used in GFS and HDFS to meet the bandwidth requirement inherently precludes meeting the read/write and file-create requirements without new technology. In addition, both HDFS and GFS are severely limited in terms of the total number of files that they can manage.
GPFS is a distributed file system from IBM that has been used in a limited way with Hadoop. See http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs31.advanceadm.doc%2Fbl1adv_gpfsrep.html. GPFS provides coherent read/write capabilities by using a distributed lock manager that allows a single node to be specified as the master for each file or file region. GPFS is able to support relatively large file stores without a centralized metadata store, but the locking scheme is unable to support high file-create rates because the throughput on the lock manager is very limited. Based on published documentation (see F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Usenix FAST Conference 2002, http://www.usenix.org/publications/library/proceedings/fast02/schmuck.html), the creation of 10 million files in one second in a cluster of 1000 machines would require over 2000 lock manager servers. Realistic clusters are limited to considerably less than one hundred thousand file-create operations per second.
In GPFS, replication is only supported as part of a disaster recovery scheme through mirroring. The lack of a first-class replication limits aggregate read bandwidth. In addition, the mirroring scheme requires quorum semantics to avoid loss of data, which makes the cluster much more failure sensitive.
pNFS (see http://www.pnfs.com/) is a parallel NFS implementation that uses many NFS servers and a central meta-data server. pNFS lacks transactional update support and, thus, does not provide coherent read/write semantics with replicas. The use of a central meta-data server severely limits the maximum file-create rate. The use of a farm of independent NFS servers for object storage makes file chunk replication difficult as well because there is no easy way to support transactionally safe replication with NFS servers. Node failure tolerance also appears to be a difficult problem with pNFS.
Ceph is an experimental distributed file system that uses an object store with an associated meta-data server. See S. Weil, S. Brandt, E. Miller, D. Long, C. Maltzahn, Ceph: A Scalable, High-Performance Distributed File System, Proceedings of the 7th Conference on Operating Systems Design and Implementation, OSDI '06 (Nov. 2006). Ceph is unable to provide coherent file chunk replicas and thus is bandwidth limited. Replication was added to Ceph as an afterthought, thus it is not suitable for use in failure-tolerant map-reduce systems. The meta-data server also imposes a limit on file-create rates. While Ceph avoids the problem of having a single meta-data server, it is still limited in terms of the number of file-creates that can be performed per second.
AFS is a distributed file store that has no support for read-write replication. See http://www.cmu.edu/corporate/news/2007/features/andrew/what_is_andrew.shtml. Under read loads, AFS allows caching of file contents close to the file client. These caches are revoked when updates are done. There is also no support for running the application on the same machine as the fileserver, thus data-locality is absent. Because there is only one master copy of any file, failures in large clusters means data becomes unavailable.
None of the foregoing systems is able to provide a fully distributed, replicated file system that allows transactional updates and cluster-wide snapshots while still supporting the requisite file-create rate imposed by map-reduce systems. Map-reduce programs can be executed using such file systems, but only by moving some of the load associated with map-reduce computation off of the file system and into a secondary storage system. Failure tolerance where file update is supported is also not sufficient in these systems to allow large-scale operation with commodity grade equipment.