The present disclosure relates to a system, a method, and an apparatus for replicating data in a Hadoop, and more particularly, to a system, a method, and an apparatus for dynamically replicating data for a heterogeneous Hadoop, which provide a dynamic data replication method for dynamically replicating data based on a probability that a map task will be allocated to a map task slot of an optimal data node by a data replication method for a Hadoop having heterogeneous clusters, and a dynamic data eviction method for evicting data based on a data access frequency.
In recent years, the biggest topic in the information technology (IT) industry is cloud computing.
The amount of data is increasing exponentially due to the spread of mobile devices and tablets. And accordingly, the use of the term “big data” has begun, and the importance of cloud computing is increasing day by day.
Cloud computing is a computing environment in which IT-related services such as the storage of data, the use of network content, and the like can be simultaneously used through a server on the Internet.
While cloud computing is getting spotlighted, interest in the Hadoop and MapReduce has also naturally been increased.
A Hadoop is a Java-based software framework and is also an open-source-based distributed computing platform, which supports a distributed application program running on a big computer cluster that can process a massive amount of data.
The Hadoop is an open source framework that is composed of MapReduce, which is a distributed processing programming model, and a Hadoop Distributed File System (HDFS), which is used throughout Hadoop.
MapReduce is a framework that distributes a massive amount of data to several nodes to process the data. An HDFS is a distributed computing platform that is used when Hadoop processes a massive amount of data.
Conventionally, a Hadoop delivers a task to a data node that stores data in order to minimize network congestion and increase throughput of the entire system. The most ideal performance may be obtained when a task is performed by a data node that stores a data block.
However, when all data nodes that store input data of tasks perform tasks, the tasks are performed after the data is copied, and a delay time is caused. Thus it causes to reduce the performance of MapReduce.
Conventionally, a Hadoop stores three copies of each piece of data stored in an HDFS in a corresponding data node, and does not have a data replication method for dynamically adjusting the number of copies according to a real-time data access request.
An access count for each piece of the data stored in the HDFS is different, and it is inefficient to keep the number of copies of data frequently requested to be accessed equal to the number of copies of data infrequently requested to be accessed.
This is because a Hadoop delivers a task to a data node at which input data is placed in order to minimize the usage of network bandwidth and maximize throughput of MapReduce jobs.
The most ideal performance may be obtained when a task is performed by a data node that stores input data. However, when data has an access request count greater than the number of its copies, there is an increasing probability that a task will be allocated to a data node that does not store data.
Accordingly, a task is performed after an input data block is additionally copied to a currently available data node. This may cause a delay time, thereby reducing performance of MapReduce jobs.