The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for optimized creation of distributed storage and distributed processing clusters on demand.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits tiles into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality—nodes manipulating the data. to which they have access—to allow the dataset to be processed faster and more efficiently than it would in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.                The base Apache Hadoop framework is composed of the following modules:        Hadoop Common—contains libraries and utilities needed by other Hadoop modules;        Hadoop Distributed File System (HDFS)—a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;        Hadoop YARN—a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications; and        Hadoop MapReduce—an implementation of the Map Reduce programming model for large scale data processing.        
A clustered file system is a file system that is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system, only direct attached storage for each node. Clustered file systems can provide features like location-independent addressing and redundancy, which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.
Distributed file systems do not share block level access to the same storage but use a network protocol. These are commonly known as network file systems, even though they are not the only file systems that use the network to send data. Distributed file systems can restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed.
A distributed computing system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.