The present invention relates to a parallel processor and, specifically, to an arithmetic control technique of an OpenCL device.
Recently, a trend of improving the performance by increasing the number of processor cores (which are hereinafter referred to simply as “cores”) that perform processing in parallel instead of increasing the operating frequency of a processor has become remarkable because of the need for preventing overheating of t processor. A processor having a plurality of cores is called a multi-core processor, and a multi core processor having a large number of cores is particularly called a many-core processor. In this specification, the multi-core processor and the many-core processor are not particularly distinguished from each other, and a processor including a plurality of cores that perform processing in parallel is generally referred to as “parallel processor”.
The parallel processor is used in various fields as an accelerator. However, because various types of accelerators are manufactured by various manufactures and in various fields and further various languages and frameworks for accelerators are developed, sharing of a program code among accelerators is difficult.
To solve this problem, an OpenCL (Open Computing Language) is defined as a standard framework for the parallel processor (Non Patent Literature 1: The OpenCL Specication, Ver:1.0, Document Revision:43, Khronos OpenCL Working Group (2009)). The overview of OpenCL is described hereinbelow.
FIG. 17 is a diagram where reference symbols are added to FIG. 3.1 in the above Non Patent Literature 1, which shows a platform model of a typical OpenCL system.
As shown in FIG. 17, an OpenCL system 10 includes a host 12 and one or more compute devices (which are referred to hereinafter as “OpenCL devices”) 14. The OpenCL devices 14 correspond to the accelerator described above.
Each of the OpenCL devices 14 includes one or more compute units (which are hereinafter abbreviated as “CU”) 16, and each of the CU 16 includes one or more processing elements (which are hereinafter abbreviated as “PE”) 18. Note that the processing elements PE18 correspond to the cores described above.
An OpenCL application is composed of a program code that runs on the host 12 and a program code that runs on the OpenCL devices 14, which are accelerators. The program code that runs on the host 12 is called “host code”, and the program code that runs on the OpenCL devices 14 is called “kernel”.
The host 12 calls API (Application Program Interface) and indicates arithmetic processing. The OpenCL devices 14 execute the indicated arithmetic processing. The host 12 generates a context for management of resources and further generates a command queue for mediation of device operation through OpenCL. “Device operation” includes executing arithmetic processing, operating memories, achieving synchronization and the like.
In OpenCL, the kernel is executed in the N(1≦N≦3)-dimensional index space as work-item (which is also referred to simply as “item”). For example, if (4,6) is specified as the two-dimensional index space, total 24 (4×6) items are executed.
To execute one item, one PE is used. Accordingly, in the case where the number of items executed in parallel and the number of existent PE are the same, the kernel is executed on the total 24 PE with four columns by six rows.
Note that, in the case where the number of existent PE is smaller than the number of items executed in parallel, parallel execution of items is repeated on the existent PE. For example, in the case where the above-described (4,6) index space is specified when there are only total six PE with two columns by three rows, parallel execution of six items needs to be repeated four times by the six PE.
Further, in OpenCL, the concept of work-group is introduced. The work group is a group of items that are executed on the same CU 16 and related to one another. The respective items in the same work group execute the same kernel and share a local memory, which is described later, of the CU 16.
A unique group ID is assigned to each work group, and a local ID that is unique in the work group is assigned to the item in each work group. Further, a unique global ID is also assigned to the item. The item can be identified by the global ID or a combination of the group ID and the local ID.
A process for the OpenCL device 14 to perform arithmetic processing is implemented by calling API in the following sequential steps.
<Step 1>: Transfer reference data for arithmetic processing (which is referred to hereinafter as “reference data”) and a kernel from the host 12 to the OpenCL device 14.
<Step 2>: Start execution of the kernel on the OpenCL device 14 by “kernel start command”.
<Step 3>: After completing execution of the kernel on the OpenCL device 14, transfer result data of the arithmetic processing from the memory space of the OpenCL device 14 to the host 12.
A configuration of the OpenCL device 14 including the memory space is described with reference to FIG. 18. FIG. 18 is a diagram where reference symbols are added to FIG. 3.3 in Non Patent Literature 1. As described earlier, the OpenCL, device 14 includes one or more CU 16 and each CU 16 includes one or more PE 18.
In the execution of the kernel in Step 2 described above, access to four different memories can be made in the OpenCL device 14. The four memories are: a private memory 20, a local memory 22, a global memory 32, and a constant memory 34. First, those four memories are described on the basis of items and work groups with reference to FIG. 19. Note that FIG. 19 is Table 3.1 in Non Patent Literature 1.
The private memory 20 corresponds to one item and used only for execution of the item. A variable that is defined for the private memory 20 corresponding to one item cannot be used for another item.
The local memory 22 corresponds to one group and can be shared by the items in the group. Thus, an example of use of the local memory 22 is to allocate a variable shared by the items in the group to the local memory 22.
The global memory 32 and the constant memory 34 can be accessed by all items in all groups. Note that, although the global memory 32 can be accessed for both read and write by items, the constant memory 34 can be accessed only for read by items. Hereinafter, the global memory 32 and the constant memory 34 are referred to collectively as a device memory 30.
From one-to-one correspondence between an item and the PE 18, the correspondence between the above-described four memories and the CU 16 and the PE 18 is as follows.
The private memory 20 corresponds one-to-one with the PE 18 and can be accessed only by the corresponding PE 18.
The local memory 22 corresponds one-to-one with the CU 16 and can be accessed by all PE 18 in the corresponding CU 16.
The device memory 30 can be accessed by all PE 18 in all CU 16, which are all PE in the OpenCL device 14.
Further, depending on the OpenCL device 14, there is a case where a cache 24 that functions as a cache memory of the device memory 30 is further provided.
As described above, a plurality of memories in different hierarchies are included in the OpenCL device 14. Those memories can be accessed by the PE at a higher speed as they are in the higher level. The hierarchical level increases in order of the device memory 30 (lowest), the local memory 22 (intermediate) and the private memory 20 (highest), and the access speed from the PE becomes higher in this order.
In order to fully bring out the performance of the OpenCL device 14, it is necessary to contrive a scheme for movement of data between the device memory 30 and the private memory 20/the local memory 22, such as referring to frequently used data after moving it to a high-speed memory space, for example.
In the case of a serial processor that is different in control method from the OpenCL device, movement of data between the global memory space and the private memory space is also performed. This is described with reference to an example of a serial processor shown in FIG. 20.
A serial processor 50 shown in FIG. 20 includes a PE 52, which is a processing element, a private memory 54, a global memory 56, and a cache control mechanism 58.
As shown therein, the storage device of the serial processor 50 is divided into the private memory 54 and the global memory 56. The private memory 54 is a low-capacity memory that is physically on-chip, and the global memory 56 is a high-capacity memory that is physically off-chip.
Although the storage device is divided into the private memory 54 and the global memory 56 in the serial processor 50, data movement between the private memory 54 and the global memory 56 is automatically done by the cache control mechanism 58 placed between the private memory 54 and the global memory 56, and a user of the serial processor 50 sees it as one large memory space. In other words, a user of the serial processor 50 can easily develop a user program for the PE 52 to perform arithmetic processing without considering how data moves between the global memory 56 and the private memory 54.
In a parallel processor, particularly the one including a large umber of cores (PE) as the OpenCL device 14 shown in FIG. 18, the same number of private memories 20 as the number of cores exist, and further the same number of local memories 22 as the number of CU 16 exist. It is generally not feasible to manage those memories all together by one cache control mechanism due to high hardware costs.
On the other hand, without the cache control mechanism, a plurality of memory spaces are seen to a user of the OpenCL system 10 (which is hereinafter referred to simply as “user”). As described earlier, in order to bring out the better performance by a scheme such as referring to frequently used data after moving it to a high-speed memory space (which is, a higher-hierarchy memory space), it is necessary to explicitly indicate by a user program movement of data between memories in different hierarchies that is involved in arithmetic processing. To achieve this correctly, a user needs to have knowledge about differences in speed, capacity, function and the like among the above-described memories. A specific example is described with reference to FIG. 21.
FIG. 21 is a diagram illustrating the case of executing arithmetic processing to obtain data blocks A′ and B′ from a plurality of data blocks (data blocks A to D). Note that, in FIG. 21, illustration of kernel transfer from the host to the device is omitted. Further, the data blocks A to D are reference data transferred from the host 12 to the OpenCL device 14 in the above-described Step 1 and stored in the global memory 32. The data blocks A′ and B′ are a result of arithmetic processing performed on the data blocks A to D in the above-described Step 2 and written into the global memory 32 and then transferred to the host 12 in the above-described Step 3.
The processing of Step 2, which is arithmetic processing to execute the kernel, is described hereinbelow. Note that, in this specification, in the case where there can be a plurality of private memories, they are referred to as “private memory group”.
If the performance of arithmetic processing is not pursued, a technique of using only the global memory 32 without using the private memory group/the local memory 22 in arithmetic processing can be employed. In this case, there is no data transfer between the global memory 32 and the private memory group/the local memory 22.
This technique is simple to control but not good in performance. In order to achieve the better performance of arithmetic processing, a technique of performing arithmetic processing after transferring data to be processed from the global memory 32 to the private memory group/the local memory 22 and then transferring a result of the arithmetic processing to the global memory 32 after storing it into the private memory group/the local memory 22 is employed as described above.
For the case of using this technique, a procedure (Steps A to C) when all items can be simultaneously executed in parallel is described first. Note that “all items can be simultaneously executed in parallel” means that the number of PE is equal to or more than the total number of items, and the capacity of the private memory group and the local memory is capable of storing all the data to be processed and, in this case, transfer of data to be processed from the global memory 32 to the private memory group/the local memory 22, parallel execution of arithmetic processing by each PE 18, transfers of a processing result from the private memory group/the local memory 22 to the global memory 32 are performed only once.
<Step A>: Transfer the data blocks A to D stored in the global memory 32 to the private memory group/the local memory 22.
This transfer is to transfer data used only by the PE 18 among the data to be processed to the private memory of the PE 18 and transfer data shred by a plurality of PE 18 to the local memory 22, for example.
Note that the data transfer from the global memory 32 to the private memory group/the local memory 22 is referred to hereinafter as “read transfer”. Further, the data block that is read-transferred such as the data blocks A to D is referred to as “read block RB”.
<Step B>: Execute arithmetic processing on each PE 18 and store a result of the arithmetic processing into the private memory/the local memory 22 that can be accessed by the PE 18.
<Step C>: Transfer the data blocks A′ and B′ obtained by the arithmetic processing in Step B and stored in the private memory group/the local memory 22 to the global memory 32.
Note that the data transfer from the private memory group/the local memory 22 to the global memory 32 is referred to hereinafter as “write transfer”. Further, the data block that is stored in the private memory group/the local memory 22 and write-transferred such as the data blocks A′ and B′ is referred to as “write block WB”.
All of the three steps need to be explicitly specified in the kernel created by a user. This specification includes the content of arithmetic processing and the content depending on the configuration of the OpenCL device 14 (the number of PE (=the number of private memories), the capacity of each private memory, the capacity of the local memory etc.).
For example, in the case where there are a plurality of read blocks RB to be processed and each of the read blocks RB needs to be segmented into sub-blocks because all of them cannot be stored in the private memory group/the local memory 22 in one work group, it
is necessary to specify an association method between sub-blocks for the plurality of read blocks RB in Step. A. The “association method” between sub-blocks of the read blocks RB means which sub-blocks of the read block RB among the sub-blocks of the plurality of read blocks RB are to betransferred to the private memory group in the same work group or the local memory 22 in same one work group. This depends on the content of arithmetic processing, and the way of segmenting them depends on the configuration of the OpenCL device 14.
Likewise, in the case where there are a plurality of write blocks WB as a result of arithmetic processing, it is also necessary to specify an association method in the meaning of under what combination of the sub-blocks of the read block RB the respective sub-blocks of the plurality of write blocks WB are obtained as a processing result. Note that the content of each sub-block in the write block WB is data stored as a processing result in the private memory group or the local memory 22 of each work group. The transfer of the write blocks WB to the global memory 32 means to write the data into each sub-block position of the write blocks WB in the global memory 32. Just like the association method of the read blocks RB, the association method of the write blocks WB also depends on the content of arithmetic processing and the configuration of the OpenCL device 14.
Besides the case where all of desired data blocks cannot be stored in the memory in the work group as described above, in the case where the total number of PE is smaller than the size of the index space, for example, all items cannot be simultaneously executed in parallel and therefore parallel execution of items by the PE needs to be repeated a plurality of times. As a matter of course, the read transfer and the write transfer also need to be repeated as the parallel execution is repeated. In this case, it is necessary to specify the segmentation method of the data blocks and the association method of sub-blocks obtained by diving the data blocks in accordance with the content of arithmetic processing and the configuration of the OpenCL device 14.
The “segmentation method” of a data block means how to segment the data block into sub-blocks. The “sub-block SB” is a unit of read transfer and write transfer. Hereinafter, when it is necessary to distinguish between read and write, the sub-blocks obtained by segmenting the read block RB is referred to as “sub-read blocks SRB”, and the sub-blocks obtained by segmenting the write block WB is referred to as “sub-write blocks SWB”.
The “association method” between sub-blocks SB means which sub-blocks SB included in different read blocks or write blocks are to reside in the same private memory group or in the same local memory 22 at the same time.
The segmentation method of a data block depends on the configuration of the OpenCL device 14, and the association method of sub-blocks depends on the content of arithmetic processing. The specification is more complicated when the segmentation is necessary compared with when a data block is not segmented.
FIG. 22 shows the description that needs to be specified by a user in order to cause the OpenCL device 14 to perform arithmetic processing.
As shown in FIG. 22, the first part is specifications for read transfer, and it includes a part depending on the content of arithmetic processing and the configuration of the OpenCL device 14.
The part depending on the content of arithmetic processing and the configuration of the OpenCL device 14 is a specification whether to segment the read block RB (example 1), a specification of the segmentation method when segmenting the read block RB (example 2), and a specification of the association method between the sub-read blocks SRB (example 3).
The second part is specifications of arithmetic processing on the read block RB or the sub-read block SRB. Because this part specifies arithmetic processing, it depends on the content of arithmetic processing as a matter of course. Further, because this part needs to conform to the specifications for read transfer, it contains the content depending on the configuration of the OpenCL device 14, such as a specification of the number of times of parallel execution of items (example 4).
The third part is specifications for write transfer, and it includes a part depending on the content of arithmetic processing and the configuration of the OpenCL device 14 (example 5) by necessity because it needs to conform to the specifications for read transfer.
As described above, in order to pursue the better performance, a user needs to develop the kernel (user code) in accordance with the content of arithmetic processing and the configuration of the OpenCL device 14.
However, even devices in conformity to OpenCL are different in the capacity of each memory space, the access speed, the access delay, the presence or absence of cache control and the like if manufacturers are different. Therefore, there is a possibility that a user code that is ideally developed for movement of data between memories in different hierarchies for a certain OpenCL device causes degradation of performance for another OpenCL device or an OpenCL device of a different generation in the same series. Thus, the portability in performance of a user code is low.
A certain degree of performance portability can be realized by creating a user code with the configurations of many types of existing OpenCL devices in mind, rather than developing a user code for a specific OpenCL device. However, this work gives a large burden on those who design arithmetic processing because it is not essential work and further causes a decrease in code legibility, an increase in complication and the like.
Patent Literature 1 (Japanese Unexamined Patent Application Publication No. 2013-025547) discloses a technique for reducing a burden on a user code developer and enhance the portability of a user code for movement of data between a plurality of memories in different hierarchies involved in arithmetic processing on the OpenCL device.
In this technique, a developer of a user code sets an attribute group containing a plurality of attributes for each of data blocks (read block and write block) to be processed and as a result of processing in the OpenCL device as arguments of the kernel, and a processing control unit of the OpenCL device automatically determines a transfer method based on the attribute group of each data block to be processed and as a result of processing indicated by the arguments of the kernel transferred from the host and parameters indicating the configuration of the OpenCL device and then controls transfer of data with the determined transfer method and arithmetic processing by the OpenCL device.
Note that the transfer method mainly relates to how to transfer data blocks between the device memory and the local memory and the private memory.
FIG. 23 shows the description that needs to be specified by a user when developing the kernel in the OpenCL device to which the technique of Patent Literature 1 is applied. As shown in FIG. 23, the description includes a specification of the attribute group and a specification of user processing only, both of which do not depend on the device configuration.
As shown in FIG. 23, the attribute group is classified into groups of a unique attribute, an arithmetic attribute and a policy attribute, and the attribute of each group is required to determine the transfer method and does not depend on the configuration of the OpenCL device. For example, the policy attribute includes an allocation attribute indicating whether to segment the data block into a plurality of sub-blocks and transfer the sub-blocks and a segmentation method when segmenting it, a margin attribute indicating the size of data neighboring sub-blocks that is transferred together with the sub-blocks when segmenting the data block into a plurality of sub-blocks and transferring the sub-blocks, and a dependence attribute indicating whether the sub-blocks have dependence with other neighboring sub-blocks when segmenting the data block into a plurality of sub-blocks and transferring the sub-blocks and indicating all dependence directions when there is dependence.
Note that the attribute group of the write clock is set on the assumption that the write block already exists in the local memory or the private memory and is transferred to the device memory.
According to the technique disclosed in Patent Literature 1, a developer of a user code (kernel) can implement the high-performance kernel with high portability simply by specifying the attribute group for each data block used by the kernel, without knowing the configuration of the OpenCL device. An arithmetic control unit of the OpenCL device can perform control to segment the data block into the size most suitable for the device based on the attribute group of each data block referred to by the kernel and the configuration of the OpenCL device and repeat the repetitive processing of “read transfer→arithmetic processing by the kernel→write transfer” the number of times equal to the number of segments. The arithmetic control unit of the device may be designed by an expert having a good knowledge of the configuration of the OpenCL device, which is a developer in the manufacturer of the OpenCL device, for example.