Field of the Disclosure
The present disclosure relates to a framework for processing Big-Data. Specifically, the disclosure is related to an enhanced Hadoop framework that improves the processing functionality of Big-Data that is implemented in a cluster of data processing nodes.
Description of the Related Art
The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Hadoop is a new technology that provides processing services for Big-Data issues in cloud computing. Many studies have discussed and developed different ways to improve the Hadoop Map-Reduce performance from different considerations or aspects. Many studies have discussed optimizing Hadoop and Map-Reduce jobs such as job scheduling and execution time to improve Hadoop performance. Some studies have also discussed data locality in cloud computing.
One of the important features of Hadoop is the process of job scheduling and job execution time. Different studies have provided some improvement information that show positive results based on a certain set of assumptions. Others focus on the time of initialization and termination phases of MapReduce jobs. System memory has many issues that could be addressed to improve the system performance. In Hadoop, Apache performs a centralized memory approach which is implemented to control the cashing and resources. Apache Hadoop supports centralized data cashing. However, some studies utilize a distributed cashing approach to improve Hadoop performance. There are different approaches that discuss memory issue. One such technique referred to as Shm-Streaming′ introduces a shared memory streaming schema to provide lockless FIFO queue that connects Hadoop and external programs.
The location of input data has been determined in current Hadoop to be located in different nodes in the cluster. Hadoop distributes duplicated data into different nodes in different network racks. Such a strategy helps for various reasons, one of which is for false tolerant issue to have more reliability and scalability. However, the default data distribution location strategy causes some poor performance in terms of mapping and reducing tasks.
Accordingly, there is a requirement for an improved Hadoop framework which enables identification of blocks in the cluster where certain information is stored. Specifically, there is a requirement for a framework which manages Big-Data applications and improves the overall performance of the system.