This invention relates to parallel access to data in a distributed file system.
An example of a distributed file system is the Hadoop Distributed File System (HDFS). HDFS is a distributed, scalable, and portable file system written in Java. The HDFS has a set of nodes (“datanodes”) that hold data for the files in the file system, and can server up blocks of the files over a data network. Each file is generally distributed over multiple nodes. A directory for the file system is maintained by a set of nodes (“namenodes”). This directory can be used to identify the locations of the distributed blocks for each named file in the filesystem.
Referring to FIGS. 1A-B, one way of processing data in a distributed file system, for example in the HDFS, is using a MapReduce programming model. Generally, a MapReduce program includes a Map procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). A user of the system specifies the Map and Reduce procedures, but does not necessarily determine the number of instances (or invocations) of each procedure (i.e., “processes”) or the nodes on which they execute. Rather, a “MapReduce System” (also called “infrastructure”, “framework”) orchestrates by marshaling a set of distributed nodes, running the various tasks (e.g., the Map and Reduce procedures and associated communication) in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process. A MapReduce system can schedule execution of instances of Map or Reduce procedures with an awareness of the data location. In FIG. 1A, the HDFS is an example of a distributed filesystem 110 and the Hadoop framework is an example of a distributed processing system, such that a set of map procedures 24 execute on processors 122 in parallel to process parts 14 of a distributed file and store output files 16 in the filesystem 110. In FIG. 1B a set of reduce procedures 26 process the outputs of the map procedures to produce output files 18 completing the MapReduce execution.
Although there are computational advantages that can be gained using a distributed file system that is integrated with a MapReduce system, access to data for processing with other programming paradigms can be inefficient. For example, extraction of data may result in a bottleneck, for example, passing all the data for a file through a single data stream, even if the processing of the data is itself parallel. For example, the HDFS provides a filesystem command (hadoop fs-cat file) which passes the content of the file to the standard output and potentially streamed to a process or server where the contents may be processed in parallel.
One approach to avoiding this type of bottleneck is to explicitly divide large files into separate named parts (i.e., with individual entries in the directory), and then cause each part to be extracted in a separate stream. However, such approaches may have disadvantages, for instance, requiring a prior decision as to the number of parts, and potential suboptimal selection of nodes on which to extract the parts, for example, because the named parts are themselves distributed.
Another approach to providing data from a distributed filesystem to an application, for instance to a database engine, is to implement the application within or tightly coupled to the filesystem. However, such an approach can be limited to processing using a specific application, and does not necessarily benefit from advances in implementations of the application that are not ported into the filesystem.