A distributed computing system comprises multiple computers, computing nodes, or processing nodes which operate independently, or somewhat independently, to achieve or provide results toward a common goal. Distributed computing may be chosen over a centralized computing approach for many different reasons. For example, in some cases, the system or data for which the computing is being performed may be inherently geographically distributed, such that a distributed approach is the most logical solution. In other cases, using multiple processing nodes to perform pieces of the processing job may be a more cost effective solution than using a single computing system, with much more processing power, that is capable of performing the job in its entirety. In other cases, a distributed approach may be preferred in order to avoid a system with a single point of failure or to provide redundant instances of processing capabilities. While processing nodes in parallel computing arrangements often have shared memory or memory resources, processing nodes in distributed computing arrangements typically use some type of local or private memory.
A variety of jobs can be performed using distributed computing. In some cases, commonly referred to as distributed data analytics, the data sets are very large and the analysis performed may span hundreds of thousands of processing nodes. In these cases, management of the data sets that are being analyzed becomes a significant and important part of the processing job. Software frameworks have been specifically developed for performing distributed data analytics on large data sets. MapReduce is one example of such a software framework for performing distributed data analytics processes on large data sets using multiple processing nodes. In a Map step of a MapReduce process, a job is divided up into smaller sub-jobs which are commonly referred to as tasks and are distributed to the processing nodes. The processing nodes perform these tasks independently and each produces intermediate results. The Reduce step, also utilizing a set of processing nodes, combines all of the individual intermediate results into an overall result. Apache Hadoop is a specific example of a software framework designed for performing distributed data analytics on very large data sets.
In distributed computing systems, the different portions or chunks that make up a full data set are often stored at or near the processing nodes that are performing analysis operations on them. The typical behavior is to schedule the tasks such that the data to be processed is available locally. In some cases, the data chunks may be moved between the processing nodes as different processing needs arise. In addition, multiple copies of the data set typically are created to satisfy data reliability and availability requirements. It is important to manage the data chunks such that storage space is not wasted by having too many instances of each data chunk, as well as managing changes to the data chunks. Dedicated file systems are used to manage the unique file management issues in distributed computing and analytics environments.
For example, Hadoop based systems use the Hadoop Distributed File System (HDFS). HDFS runs on top of the file systems of the underlying operating systems of the processing nodes and manages the storage of data sets and data chunks across the storage spaces associated with the nodes. This includes making the proper data chunks available to the appropriate processing nodes as well as replicating the data across nodes to satisfy data reliability and availability requirements. Nodes are often spread out over different geographic locations and at least one copy of the data is typically stored on multiple nodes in different geographic locations for backup purposes.
In addition to managing backup copies, data management and reliability activities may also need to access or manipulate the data chunks for other related purposes. The data management activities may include backup, archiving, mirroring, cleanup, compression, deduplication, and optimization. While these activities are being performed, access to the data chunks is typically still needed for analysis and processing purposes. Analysis performance can be degraded if data access is slowed or prohibited while the data is being accessed for these other purposes. Consequently, performance and reliability requirements often result in competing objectives.
In addition, moving data into or out of a specialized file system, such as HDFS, typically requires additional software activities, interfacing with other data management systems, tools, or processes and is often inefficient or cumbersome. Since distributed analytics processes typically utilize specialized file systems purpose-built for analytics, they lack many of the standard interfaces. In particular, legacy applications such as databases, source control, and productivity applications are unable to store data in the specialized file systems. Therefore, analysis of the data generated by these types of applications typically requires that the data be migrated or exported from the traditional file system and imported into the specialized file system before running an analytics job. This migration phase can be time consuming and tedious as it involves moving large amounts of data and must be performed every time data from the traditional file system is transferred.