The present invention relates to an arithmetic and control technique of a parallel processor.
In recent years, from the need to, suppress the heating of the processor, there has been a pronounced tendency to realize an improvement in performance by increasing the number of processor cores (hereinafter referred to merely as “cores”) that conduct processing in parallel instead of an increase in the operating frequency of the processor. The processors each having a plurality of cores are called “multicore processor”, and the processors each having an especially large number of cores among the multicore processors are called “many-core processor”. In the present specification, there is particularly no distinction between the multicore processors and the many-core processors, and the processors each having a plurality of cores that conducts processing in parallel are generally called “parallel processors”.
The parallel processors have been used in a variety of fields as accelerators. However, a variety of accelerators have been manufactured depending on the manufacturers or the fields, and languages and frameworks for the accelerators have been also variously developed. This makes it difficult to port program codes between the accelerators.
In order to solve this problem, an OpenCL (open computing language) is determined as a standard framework for the parallel processor (The OpenCL Speciation, Ver: 1.0, Document Revision: 43, Khronos OpenCL Working Croup (2009)). An outline of the OpenCL will be described.
FIG. 19 illustrates a platform model of a typical OpenCL system in which reference numerals are added to FIG. 3.1 in The OpenCL Speciation, Ver: 1.0, Document Revision: 43, Khronos OpenCL Working Group (2009).
As illustrated in FIG. 19, an OpenCL system 10 includes a host 12 and one or more compute devices (hereinafter referred to as “OpenCL devices”) 14. The OpenCL devices 14 correspond to the above-mentioned accelerators.
Each of the OpenCL devices 14 has one or more compute units (hereinafter referred to as “CUs”) 16, and each of the CUs 16 has one or more processing elements (hereinafter referred to as “PEs”) 18. The PEs 18 correspond to the above-mentioned cores.
The application of the OpenCL includes a program code that operates on the host 12 side, and program codes that operate in the OpenCL devices 14, that is, on the accelerator side. The program code that operates on the host 12 side is called “host code”, and the program codes that operate on the OpenCL devices 14 side are called “Kernel”.
The host 12 calls an API (application program interface) for instruction of arithmetic operation. Each of the OpenCL devices 14 executes the instructed arithmetic operation. The host 12 generates a context for managing resources, and also generates command queues for adjusting device operation through the OpenCL. The “device operation” includes the execution of arithmetic operation, the operation of a memory, and synchronization.
In the OpenCL, the Kernel is executed in an N-dimensional index space (1≤N≤3) as a work-item (hereinafter called “item” for short). For example, if (4, 6) is designated as the two-dimensional index space, 24 items of 4×6 in total are executed.
One PE is used for execution of one item. Accordingly, if the number of items to be executed in parallel is identical with the number of PEs really existing with the items, the Kernel is executed on 24 PEs of 4×6 in total.
If the number of existent PEs is smaller than the number of items to be executed in parallel, the parallel execution of the items is repeated on the existent PEs. If there are, for example, only 6 PEs of 2×3 in total, when the above-described index space of (4, 6) is designated, it is necessary to repeat the parallel execution of 6 items by the 6 PEs four times.
Also, in the OpenCL, a concept of a work group is introduced. The work group is an assembly of items that are executed on the same CU 16 and associated with each other. The respective items within the same work group execute the same Kernel, and share a local memory of the CU 16, which will be described later.
Unique group IDs are allocated to the respective work groups, and the items within each work group have unique local IDs allocated thereto within the work group. Unique global IDs are also allocated to the items. The items can be identified by the combination of the global ID or the group ID with the local ID.
A process for allowing the OpenCL devices 14 to conduct arithmetic processing is configured by calling the API in the following step order.
<Step 1>: Reference data (hereinafter referred to as “reference data”) for the arithmetic processing and the Kernel are transferred from the host 12 of the OpenCL devices 14.
<Step 2>: The Kernel starts to be executed on each of the OpenCL devices 14 in response to “Kernel start command”.
<Step 3>: After completion of the execution of the Kernel in the OpenCL device 14, result data of the arithmetic processing is transferred to the host 12 side from the memory space of the OpenCL device 14.
A configuration of the OpenCL device 14 including the memory space will be described with reference to FIG. 20. In FIG. 20, reference numerals are added to FIG. 3.3 in “The OpenCL Speciation, Ver: 1.0, Document Revision: 43, Khronos OpenCL Working Group (2009)”. As described above, each of the OpenCL devices 14 includes one or more CUs 16, and each of the CUs 16 has one or more PEs 18.
In the execution of the Kernel in the above-described Step 2, four different memories may be accessed in each of the OpenCL devices 14. Those four memories include private memories 20, local memories 22, a global memory 32, and a constant memory 34. Those four memories will be described with reference to FIG. 21 from the viewpoint of the items and the work groups. FIG. 21 illustrates Table 3.1 in “The OpenCL Speciation, Ver: 1.0, Document Revision: 43, Khronos OpenCL Working Group (2009)”.
Each of the private memories 20 corresponds to one item, and is used for only execution of the item. A variable defined for the private memory 20 corresponding to one item cannot be used for the other items.
Each of the local memories 22 corresponds to one group, and can be shared by the respective items within the group. For that reason, as an intended purpose of the local memories 22, for example, the variables shared by the respective items within the group are allocated to the local memory 22.
The global memory 32 and the constant memory 34 can be accessed from all of the items within all of the groups. The global memory 32 can be accessed for both of read and write from the items. On the other hand, the constant memory 34 can be accessed for only read from the items. Hereinafter, the global memory 32 and the constant memory 34 are collectively called “device memory 30”.
From the one-to-one relationship between the items and the PEs 18, the correspondence relationship among the above four different memories, and the CUs 16 and the PEs 18 will be described below.
The private memories 20 correspond one-to-one to the PEs 18, and can be accessed by only the corresponding PEs 18.
The local memories 22 correspond one-to-one to the CUs 16, and can be accessed by all of the PEs 18 within the corresponding CUs 16.
The device memory 30 can be accessed by all of the PEs 18 within all of the CUs 16, that is, all of the PEe within the OpenCL devices 14.
Also, a cache 24 that functions as a cache memory of the device memory 30 is further provided depending on each of the OpenCL devices 14.
Thus, each of the OpenCL devices 14 is equipped with a plurality of memories different in hierarchy. Those memories can be accessed from the PEs at a higher speed as the hierarchy is higher. The hierarchy becomes higher in the order of the device memory 30 (lowest), the local memories 22 (middle), and the private memories 20 (highest), and the access speed from the PEs becomes higher in the same order.
In order to sufficiently bring out the performance of the OpenCL devices 14, it is necessary to devise data movement between the device memory 30 and the private memories 20 or the local memories 22, for example, such that data higher in use frequency moves to a higher-speed memory space, and then is referred to.
Even in the case of a sequential processor different in the control system from the OpenCL device, the data movement is conducted between a global memory space and a private memory space. The data movement will be described with reference to an example of the sequential process illustrated in FIG. 22.
A sequential processor 50 illustrated in FIG. 22 includes a PE 52 which is an arithmetic element, a private memory 54, a global memory 56, and a cache control mechanism 58.
As illustrated in FIG. 22, a storage device of the sequential processor 50 is classified into the private memory 54 and the global memory 55. The private memory 54 is physically an on-chip low-capacity memory, and the global memory 56 is physically an on-chip high-capacity memory.
In the sequential processor 50, the storage device is classified into the private memory 54 and the global memory 56. The data movement between the private memory 54 and the global memory 56 is automatically conducted by the cache control mechanism 58 disposed between the private memory 54 and the global memory 56, and a user of the sequential processor 50 can see nothing but one large memory space. That is, the user of the sequential processor 50 can easily develop a user program allowing the PE 52 to conduct the arithmetic processing without planning how to move data between the global memory 56 and the private memory 54.