The present invention relates to a distributed computing technology, and more particularly relates to a distributed parallel computation using a plurality of accelerator devices.
Recently, a multi-processor computation using GPUs (Graphic Processor Units) has been widely used to enhance computation efficiency and/or computation speed. The GPUs are typically used as accelerators of a main CPU for enhancing the computation performance. Such multi-processor computation architecture often uses a GPU network connected over internal buses such as PCI or PCI-Express etc. Such GPUs connected by the internal buses are herein referred to tightly-coupled accelerator devices.
The GPUs are operated in parallel under control by a host CPU by an adequate programming language to enhance the computation performance. One example of such programming language may include OpenCL. OpenCL may be applied to manage data transfer between the host CPU and GPUs and may be utilized by this invention to minimize the performance cost of that transfer.
The multi-processor computation architecture in another scheme has been known such as for example, distributed computation or grid computation. These multi-processor computation architectures may include a plurality of servers or computers which share computations under control by a host computer or a master computer. In this type of multi-processor architecture, the computers are connected with an external bus network such as Ethernet (Trade Mark) and a network interface card using various physical connection protocols. The computers may support the entire computation executed within the network and hence the computers responsible to the distributed computing may also be regarded as the accelerators. However, the computers in such distributed computation architecture are connected by the external network though TCP/IP and the computers in the distributed computing system may be regarded as loosely-coupled accelerators.
In the loosely-coupled multi-processor system, the computers or nodes are connected by the external network and hence, data transfer between the host computer and the accelerator devices may be affected by transport conditions including data sizes, runtime implementation and network conditions.
Enhancement of the computation performance through TCP network also has been developed so far; for example, US Patent Application Publication 2008/029098A1 discloses a computer system which dynamically segments a large TCP segment with smaller TCP segments so as to reduce interrupt frequency. JP2011-170732 discloses the parallel computation method which divides a functional block into strands and modifies the functional block depending on computation time.
In the tightly-coupled acceleration architecture, it has been proposed that batching many small transfers into one larger transfer will improve the data transfer performance (reference NVIDIA OpenCL Best Practice Guide, Section 3.1 “Data Transfer between Host and Device”). In addition, Kim, et al. discloses, in “A New Communication and Computation Overlapping Model with Loop Sub-Partitioning and Dynamic Scheduling”, a communication and computation overlapping model to hide the communication latency in data parallel programs.