In order to improve processing performance of a computer system, there has been an approach to implement a coprocessor as hardware which executes operation specialized in a particular filed at a high speed, besides a processor governing the main processing. As an example of such a coprocessor, a GPGPU (General Purpose Graphic Processing Unit) has been known. A GPGPU is a unit in which a GPU for graphic is adapted to be used for general-purpose numerical calculation. Typical products thereof include Tesla (registered trademark, NVIDIA Corp.) and Radeon (registered trademark, AMD Inc.). In general, a GPGPU is not usable alone, and is used in combination with a CPU (Central Processing Unit) without fail. More specifically, data is once loaded to a main memory from an external device, then a CPU starts processing, and a part of the processing is off-loaded to a GPGPU. The data processed by the GPGPU is stored in the main memory again. However, when data from the external device is transferred to the GPGPS via the main memory, the overhead at the time of data transfer becomes large.
As such, JP 2010-272066 A (Patent Document 1) discloses an example of a tightly coupled multiprocessor system in which the overhead for data exchange between an external device and a coprocessor such as a GPGPU is reduced. The tightly coupled multiprocessor system disclosed in Patent Document 1 includes a main processor having a plurality of processor cores, a main memory, an input/output interface circuit for performing connection with an external device, and a processor element (see FIG. 1 of Patent Document 1, for example).
The processor cores included in the main processor are connected via an internal bus or a crossbar switch. Further, the main processor is connected with the main memory via a memory bus, and is connected with the input/output interface circuit and the processor element via external interfaces such as PCI Express.
The processor element is a coprocessor which operates by instructions from the processor cores. The processor element includes a local memory for processing a large quantity of data. The local memory is directly accessible from the processor element and each processor core. Further, the local memory is able to perform DMA (Direct Memory Access) transfer of a large amount of data with the input/output interface circuit which allows connection with an external device.
In Patent Document 1, in order to further improve the operational performance, a plurality of processor elements are connected with the main processor via an external interface (see FIG. 3 of Patent Document 1, for example).    Patent Document 1: JP 2010-272066 A
As described in Patent Document 1, by directly transferring data between the local memory of a coprocessor such as a GPGPU and an input/output interface circuit used for connection with an external device without using a main memory, it is possible to reduce the latency of data transfer between the external device and the local memory of the coprocessor.
However, in the case of increasing the number of pieces of coprocessors in order to improve the performance, a sufficient improvement in performance cannot be expected by simply increasing the number of coprocessors as described in FIG. 3 of Patent Document 1. This is because as the coprocessors share the same input/output interface circuit, the transfer rate at each coprocessor becomes low.