Developments in computer and networking technology have given rise to applications that require massive amounts of data storage. For example, tens of millions of users can create web pages and upload images and text to a social media website. Consequently, a social media website can accumulate massive amounts of data each day and therefore need a highly scalable system for storing and processing data. Various tools exist to facilitate such mass data storage.
Hadoop is a popular open source framework that supports large-scale data-intensive distributed applications, by enabling applications to interact with a cluster of thousands of computers (also referred to as nodes) and petabytes of data. The Hadoop framework utilizes a distributed, scalable, portable file system, called Hadoop Distributed File System (HDFS), to distribute a massive amount of data among data nodes (also referred to as slave nodes) in a Hadoop cluster. In order to reduce the adverse impact of a data node power outage or network failure (including switch failure), data in an HDFS is typically replicated on different data nodes. The HDFS can include at least one metadata node to host a file system index.
However, in order to store the huge amount of data in the Hadoop cluster, the cluster needs a large number of nodes. The number of nodes requires a significant space to maintain the cluster. In addition, all the hardware running within the cluster consumes a significant amount of power. These factors limit the usability of a conventional Hadoop cluster.
For example, certain applications employing real-time data collection and analysis, such as signal intelligence gathering, may require both mobility and very large data storage capacity. Yet a conventional Hadoop cluster, which has very large storage capacity, is not mobile due to its size and power consumption limitations. Therefore, unless the data collection equipment for such an application has a communication link (wired or wireless) to a conventional Hadoop cluster while collecting data “in the field,” which may not be practical or desirable, the application cannot take advantage of the benefits of a conventional Hadoop cluster.