It is tried to improve performance of an application by vectorization and adoption of SIMD (Single Instruction, Multiple Data) with respect to arithmetic functions of a processor. By simultaneously executing operations of a plurality of element data which are load objects by one instruction, the operation throughput of the processor increases, and the performance of the processor improves. To make a SIMD application, transfer of data also adapts to SIMD between a main memory and a register. Here, “element data” refers to individual data to be the load object.
It is easy to adapt transfer of data to SIMD between the memory and the register with respect to data stored in continuous areas in the memory. The memory access by an application is possibly not an access to continuous areas. For example, among science and technology calculations, there are many applications handling a sparse matrix operation or a data structure, and there are demands for adoption of SIMD to accelerate a memory access for data stored in non-continuous areas in the memory.
Hitherto, the transfer of data between the memory and the register with respect to data stored in non-continuous areas in the memory is programmed by using a plurality of instructions, such as a shift instruction, a data insertion instruction, and a data movement instruction in the register, and thus there are problems that the programming becomes complicated and that the performance is not high. Accordingly, processors having a gather-load instruction are appearing, which is an instruction to gather and load a plurality of data stored in non-continuous areas in the memory into one register.
The gather-load instruction is highly flexible and facilitates programming, but is difficult to process at high speed by hardware, and sufficient performance is not achieved in practice. The data size and the data range which a cache access and a memory access can have are restricted by a physical hardware configuration. For example, in general, data in different cache lines cannot be accessed simultaneously.
The gather-load instruction possibly accesses completely different addresses for all the element data which are a plurality of load objects. Thus, assuming a worst case, a mounting method for the gather-load instruction to disassemble into processes per element data and load respective element data in parallel is conceivable. However, when processes are performed per element data, the throughput performance effect of SIMD adoption is not obtained through the process of the gather-load instruction.
When it is tried to increase the throughput performance of the gather-load instruction, in a case where a plurality of element data are simultaneously accessible, it is conceivable that simultaneous loading of them can decrease the number of times of cache access. Specifically, it is conceivable to proceed with processing in order from simultaneously loadable element data in combination with mask information indicating whether it is necessary to load every element data which is a load object.
In this method, first, a request to the head element data which needs to be loaded (whose mask information is 1) is issued to perform a load process. At this time, the head element data which needs to be loaded and the subsequent element data on the same cache line are simultaneously processed, and the mask information of each element data which finished being processed is set to 0 (zero) to update it to a processed state. Next, the process is re-executed on the element data for which mask information is 1 (the load process needs to be executed) in the first place when it is seen from the head side, so as to perform the subsequent load process.
The above process is re-executed plural times repeatedly as long as element data which need to be loaded (for which mask information is 1) exist, and when no element data which need to be loaded (mask information is all 0) exist, the entire gather-load instruction itself is completed. In this method, the mask information and the element data to be processed next are determined by the result of the previous process. This is hence a serial process, the latency of the entirety becomes long, and there is a problem of low performance.
As an example of mounting this method as hardware, a method is described in Patent Document 1, for which hardware resources for retaining addresses, masks, and offsets of all element data are provided in a gather control unit, resulting in a large increase in physical quantity of the circuit. It is also conceivable to mount this method as software so as to re-execute a plurality of times of gather-load instruction in a program level. However, when the address range of the load object are located across plural cache lines, the gather-load instruction is sequentially processed again and again, resulting in quite large latency.
Further, a method as follows is proposed in a processor in which update of mask information accompanying the completion of processing the previous element data and address generation related to the next element data are internally divided into plural serial processes in the level of an instruction issuing unit and instruction operating unit in the hardware (see Patent Document 2). An index table having address offsets converted from plural processing vector registers and having mask information is provided near an address generator, and the update of mask information accompanying the completion of processing the previous element data and the address generation for the next element data are processed simultaneously. Thus, the number of plural serial processes internally divided in the hardware is decreased, to thereby improve the performance. Further, when element data whose address offsets are exactly the same exist, data read as head element data is broadcasted to all the element data of the vector register in advance, and a plurality of actual write signals are simultaneously asserted and simultaneously processed, to thereby achieve high speed.
[Patent Document 1] U.S. Patent Application Publication No. 2012/0254542
[Patent Document 2] U.S. Patent Application Publication No. 2015/0074373