A server system, which includes a main processor and a sub processor such as General Purpose Graphic Processing Units (GPGPU), is prevailing. The server system is used frequently in order to realize high performance, that is, to make an execution time (hereinafter, called ‘latency’), in which a program is executed to process single input data or a set of input data corresponding to an unit of process target, shortened.
In order to make latency of the program shortened in the system, a scheme (method) to make the sub processor execute one or more partial programs, which are included in a main processor program, is used in some cases (the scheme is called ‘offload scheme’). A program, which is intended to be executed by the main processor, is called ‘main program’. A portion of the main program (partial program), which is executed by the sub processor according to the offload scheme, is called ‘offload part’ or ‘offload program’. That the main processor makes the sub processor execute a program is called ‘offloading’. That the main processor designates a program to offload, that is, the main processor designates an offload part is called ‘offload-designating’.
In general, ‘offload scheme’ is realized by carrying out the following three procedures.
1) A main processor transfers data to a sub processor to make the sub processor execute an offload part. A program code of the offload part is transferred simultaneously at this time, or is stored in advance in a predetermined storage apparatus as a sub processor program.
2) The sub processor executes the offload part.
3) The sub processor transfers a result of executing the offload part to the main processor.
In order to shorten latency of being time to execute the whole program by the offload scheme, latency of the offload part, which is executed by the sub processor, is shorter than one which is executed by the main processor. In general, a range of the offload part included in the main program is designated by a person who develops the main program (hereinafter, simply called ‘program developing person’).
In order to shorten the latency, the program developing person determines the offload part in consideration of a latency shortening effect which is acquired by using the sub processor, and a time which is required for transferring data.
In many cases, the program developing person designates the offload part of the main program by embedding an instruction statement, which designates a range of the offload part and the data to be transferred, in the main program. In order to designate the data to be transferred, the program developing person must analyze data which the sub processor requires to process the offload part, and data which the sub processor transfers to the main processor after processing the offload part. Since such the data analysis is difficult in general, it is difficult that the program developing person designates a desired range of the main program as the offload part. However, there is a process, in which analysis on the data to be transferred is easy, such as an input process which receives data to execute the program, and an output process which outputs a result of executing the program.
On the other hand, in the case that parallel processes are carried out by executing a plurality of programs simultaneously for a plurality of input data with the offload scheme, not only the latency but also a large amount of data (hereinafter, called ‘throughput’) to be processed per an unit time are required in the system. In order to acquire high throughput, it is important to use resources of the main processor and the sub processor efficiently. However, in order to use the resources of the processor with thorough efficiency, it is mandatory that a ratio of quantity of the resource per the processor which is used by the program coincides with a ratio of quantity of available resource per the processor. Therefore, even if a plurality of processors execute a program, which is created in non-consideration of the parallel operation, at the same time, it is impossible to use the resource efficiently. Accordingly, it is impossible to acquire high throughput even if carrying out the parallel operation to such the program.
FIG. 19 shows an example of the parallel operation which causes a problem that a resource of a processor is left and consequently it is impossible to acquire high throughput. According to the example, a host processor and an accelerator, which supports processes of the host processor as the sub processor, are arranged. It is assumed that quantity of resource of the accelerator is larger than one of the host processor.
In order to execute a partial program which is designated as the offload part, both of the resource of the host processor and the resource of the accelerator are required. That is, when executing the offload part for one input data, the resource of the host processor and the resource of the accelerator are used with the same quantity each. Then, if a difference in the quantity of the resource between the host processor and the accelerator exists, the following problem is caused. When executing a program, which uses the resource of the host processor and the resource of the accelerator with the same quantity each, for a plurality of input data, the resource of the host processor cannot be used any more, and the resource of the accelerator is left. As mentioned above, the problem that, while the resource of the accelerator is left, the program cannot be executed for input data whose number is not smaller than number of the input data which are being processed at this time is caused.
In general, when executing a plurality of programs, each of which includes an offload part, in parallel, it is not easy to use the processor resource effectively. Therefore, various kinds of arts are disclosed with respect to selection of the offload part, and allotment of the processor resource to the program.
There is a method to determine which processor should execute respectively loops included in input software (for example, refer to a patent literature 1). In the art which is described in the patent literature 1, a time for transferring data to the accelerator is measured, and a win-loss table, which indicates superiority-inferiority between execution times of the host processor and the accelerator, is generated. Then, a loop which is an offload target is determined based on the win-loss table, and the input software is converted so that the loop may be offloaded.
Moreover, there is a method to allot the processor resource to each of plural programs by a classification of the program (for example, refer to a patent literature 2). In the art which is described in the patent literature 2, a real time program and a non-real time program are separated and these are allotted to the resource, and consequently a system can execute a plurality of programs without one program's occupying all the resource.