Cloud computing means that a plurality of computers are linked as one cluster to constitute a cloud serving as a virtual computing platform and data storage and computation are delegated to the cloud which is a cluster of computers rather than an individual computer. Cloud computing is widely used in various fields, particularly in recent years, a ‘Big Data’ field.
Big Data refers to a huge amount of data of petabytes or more beyond terabytes. Since Big Data cannot be processed by a single computer, cloud computing is regarded as a basic platform for processing Big Data.
Meanwhile, as a representative example of the cloud computing environment for Big Data, Apache's Hadoop system is popular.
The Hadoop includes a Hadoop Distributed File System (HDFS) that allows input data to be divided and processed, and data that has been distributed and stored is processed by a MapReduce process developed for high-speed parallel processing of large amounts of data in a cluster environment.
However, the Hadoop distributed file system is configured as a central file system such that a manager for managing directories is provided centrally and performs all management processes to figure out what data is stored in each server and, thus, has a drawback in that it manages an excessively large amount of data, thereby degrading the performance.
Further, in the Hadoop distributed file system, the file is divided appropriately without taking the contents of the file into consideration and then distributed and stored in multiple servers. Accordingly, necessary data may be stored in only a specific server, and no data may be stored in a certain server.
Therefore, when the Hadoop receives a MapReduce task request, since several map functions are executed only in the server storing a large amount of specific data required for processing the MapReduce task request, there is a problem that a load balance cannot be achieved, thereby degrading the performance.