1. Technical Field
The present invention relates generally to an apparatus and method for managing resources in a cluster computing environment and, more particularly, to an efficient resource management method and apparatus based on policies, which are capable of distributing and managing resources while taking into consideration the resource characteristics of the resources in a cluster computing environment including high-performance heterogeneous resources. That is, the present invention relates to improving the efficiency of resource management by allocating optimal and heterogeneous resources in accordance with the various characteristics of application software in an environment in which nodes constituting a cluster system include heterogeneous resources.
2. Description of the Related Art
A distributed/parallel computing environment which is the mainstream in the field of High Performance Computing (UPC) is a cluster system. Furthermore, with the development of hardware technology, the resources of nodes constituting a cluster system are being diversified and are becoming heterogeneous, and the capacity supported by each resource is increasing.
FIG. 1 is a diagram showing the configuration of a cluster resource management system, and FIG. 2 is a detailed diagram showing the resource agent node of FIG. 1.
The cluster resource management system may be formed of a heterogeneous many-core-based HPC cluster resource management system.
Most of HPC cluster systems provide a dedicated resource management system. Referring to FIG. 1, from the viewpoint of a resource management system, the hardware of a cluster system 100 may include resource agent nodes 140 each including computation performance acceleration nodes 150 based on heterogeneous computing devices, and a resource manager node 130 providing the effective system management and service of heterogeneous resources. The resource agent nodes 140 and the resource manager node 130 may be connected over a high-speed network-based system network Accordingly, a client node 110 connected to the heterogeneous HPC cluster system over a public network 120 may access the heterogeneous HPC cluster system, request the allocation of resources to perform a task, and then execute application software on allocated nodes including the requested resources when the nodes are allocated.
As shown in FIG. 2, resource agent nodes 141, 142 and 143 which constitute part of a cluster may have different hardware resource forms depending on their roles.
That is, a heterogeneous many-core cluster does not include a form in which nodes have the same resource configuration and computing capability, but has a form in which nodes have configurations and computing capabilities specific to their resource configurations.
Accordingly, the improvement of overall operation performance can be achieved only when applications capable of efficiently using resources based on the characteristics of each node are executed. That is, as shown in FIG. 2, the resource agent node 140 may include nodes 141 including performance computation acceleration apparatuses such as a Graphics Processing Unit (GPU), nodes 142 including a different type of performance computation acceleration apparatuses such as a Many Integrated Core (MIC), and nodes 143 on each of which high capacity memory supporting high capacity memory BIGMEM is mounted.
Therefore, the resource agent node 140 may include a node configuration which guarantees better performance when performing an application program chiefly using a Central Processing Unit (CPU), a node configuration which requires high performance input and output or high capacity memory, and a node configuration which guarantees better performance when performing an application program chiefly using data, depending on the types of resources which constitutes each node.
A conventional resource management system for an HPC cluster environment is problematic in that it does not sufficiently take into consideration efficient resource management for a heterogeneous many-core-based HPC system which utilizes various performance acceleration apparatuses, such as a GPGPU, an MIC, and an FPGA, together with a general-purpose processor (CPU).
Meanwhile, with the development and improvement of hardware technology, the targets of management in each constituent node have become gradually heterogeneous and have had high capability. That is, each constituent node may have heterogeneous performance acceleration apparatuses such as a general-purpose processor, a GPGPU, and an FPGA having hundreds of cores and high-capacity node memory having a capacity equal to or higher than hundreds of Giga bytes.
Furthermore, each socket, that is, a set of cores, has memory, the sum of the capacities of such pieces of memory is the capacity of node memory, and the access cost to another core are relatively high because the memory belongs to the same system but has a different distance. Accordingly, to achieve efficient execution and improved performance, it is effective to allocate memory connected to a relevant core. If the positions of associated resources or the distances to the associated resources, such as a processor and memory, are not taken into consideration when an application is performed, performance is deteriorated. That is, if resources are not efficiently allocated and managed in accordance with the characteristics of an application being performed, overall resource utilization is significantly deteriorated and the performance of execution of the application is not sufficiently guaranteed.
Furthermore, the performance of a parallel program which is performed in a cluster environment, such as a Message Passing Interface (MPI), is dependent upon a network data transfer rate. Accordingly, adjacent nodes which may minimize the communication costs between allocated nodes on the same application. In order to achieve such allocation, it is necessary to check the network topology of all nodes constituting a cluster and the communication costs between nodes and allocate the nodes after taking into consideration the network topology and the communication costs. In the conventional resource management system, however, a node topology and communication costs are not taken into consideration when parallel program resources are allocated.
Accordingly, in accordance with the conventional technology, there is a significant difference in resource utilization depending on the application, resulting in low computation performance efficiency. Furthermore, there is a problem in that the efficiency of operation of resources in multi-task scheduling is low because resource allocation in an environment in which heterogeneous computing resources having heterogeneous characteristics are mixed is not sufficiently taken into consideration.
Furthermore, in order to solve the above problems, an invention relating to the monitoring of the status of resource utilization (Korean Patent Application Publication No. 10-2010-0073120) was disclosed, but the invention has a limitation on application to a cluster system in which various types of resources are mixed.