A SIMD (Single Instruction Multiple Data) instruction is one of instructions for the execution of arithmetic processing on multiple data with a single instruction. The SIMD instruction will be described with reference to FIG. 16. A processor that executes the SIMD instruction includes an instruction buffer 1601 storing instructions, a plurality of processing units (PU) 1602-1 to 1602-4 which perform arithmetic processing, and a data buffer 1603 storing data. In the execution of one SIMD instruction fetched from the instruction buffer 1601, the processing units 1602-1 to 1602-4 concurrently apply arithmetic processing indicated by the instruction to a plurality of data D1 to D4 stored in the data buffer 1603. The SIMD instruction is used when the same arithmetic processing is executed on a plurality of data in parallel, as is done in matrix calculation.
Square matrix multiplication illustrated in FIG. 17 will be described as an example. In this square matrix multiplication, a 4-row×4-column matrix C is calculated by the multiplication of a 4-row×4-column matrix A and a 4-row×4-column matrix B. The numerical values of each element of the matrices represent “row number” and “column number”. For example, “a12” represents a data element at the first row and second column of the matrix A. An element cij at the i-th row and j-th column of the 4-row×4-column matrix C is calculated by the following multiply-add operation.cij=ai1×b1j+ai2×b2j+ai3×b3j+ai4×b4j 
For example, an element c11 at the first row and first column of the matrix C is calculated by the following multiply-add operation.c11=a11×b11+a12×b21+a13×b31+a14×b41  (1)
It is assumed that the processing units of the processor are each capable of executing the multiply-add operation of “C=A×B+C”. This arithmetic processing is generally called FMA (Floating point Multiply Add, Fused Multiply Add, etc.), and recent processors include FMA instructions implemented thereon. The FMA instruction is typically given totally four operands, namely, three operands A, B, C as source operands which are objects of the operation and one operand C as a destination operand which is the operation result.
The element c11 at the first row and first column of the matrix C, which is found by the aforesaid expression (1), can be calculated with the following four FMA instructions. Note that, in the description below, operands given to each of the FMA instructions are the source operand A, the source operand B, the source operand C, and the destination operand C in this order. In the first FMA instruction, 0 is given as an initial value of the result of the multiply-add operation.
FMA a11, b11, 0, c11
FMA a12, b21, c11, c11
FMA a13, b31, c11, c11
FMA a14, b41, c11, c11
The elements of the matrix C can be calculated in parallel because of no dependency among the arithmetic processing of the elements. Accordingly, if a processor which executes a SIMD instruction includes four processing units and performs FMA operations as the SIMD instruction, it is possible to concurrently calculate the four elements of the matrix C. For example, as illustrated in FIG. 18, by a processing unit (PU #1) 1801 executing the operation relevant to the element c11 of the matrix C, a processing unit (PU #2) 1802 executing the operation relevant to the element c12 of the matrix C, a processing unit (PU #3) 1803 executing the operation relevant to the element c13 of the matrix C, and a processing unit (PU #4) 1804 executing the operation relevant to the element c14 of the matrix C, it is possible to concurrently calculate the elements c11, c12, c13, c14 of the matrix C with the four processing units (PC). Accordingly, the calculation of the elements on one row of the matrix C is completed with the four SIMD instructions, and this is repeated four times, that is, the calculation of all the sixteen elements of the matrix C is completed with sixteen SIMD instructions.
If FMA instructions are executed using high-frequency design hardware, the cycle time which is a reciprocal of the frequency becomes further shorter, making it difficult to complete the execution of the FMA instruction in one cycle. For example, if the operation latency of the FMA instruction is four cycles, it is necessary to execute the instruction every four cycles, with a time lag corresponding to three cycles being provided in each interval between the SIMD FMA instructions as illustrated in FIG. 19A, resulting in pipeline bubbles during the three cycles. A method called a software pipeline is one method to avoid the pipeline bubbles. The software pipeline improves the operating rate of a processing unit by inserting other instructions having no data dependency into an empty cycle between instructions having data dependency. For example, as illustrated in FIG. 19B, in empty cycles of a processor calculating a certain element of a matrix, a sequence of instructions for calculating other elements of the matrix is inserted.
FIG. 20 and FIG. 21 illustrate timing charts when the instructions are executed as illustrated in FIG. 19A and FIG. 19B respectively. As illustrated in FIG. 20 where the software pipeline is not used, the FMA instructions are each executed in four cycles, namely, the first stage to the fourth stage. After the first instruction is supplied for the execution, there is a four-cycle latency time until the next instruction having data dependency is supplied, and accordingly the three stages other than the stage where the execution is progressing becomes idle. On the other hand, as illustrated in FIG. 21 where the software pipeline is used, cycles where instructions are supplied and executed are staggered by one cycle each time, which makes it possible to execute the instructions in different stages concurrently, enabling the highly efficient operation of the arithmetic units.    Patent Document 1: Japanese Laid-open Patent Publication No. 2015-55971    Patent Document 2: Japanese Laid-open Patent Publication No. 2008-3708
However, even if the SIMD instruction for the aforesaid parallel arithmetic processing is used, it requires many instructions to execute the same arithmetic processing on a plurality of data a plurality of times as in matrix calculation. For example, the above-described operation of the 4-row×4-column square matrix multiplication requires only sixteen instructions, but as the size N of the square matrix is larger, the number of instructions increases on an O (N2) order. Further, in a convolution operation often used in deep learning, if an image size is N×N and a kernel size is M×M, the number of instructions increases on an O (N2M2) order.
The convolution operation is processing which uses a small rectangular filter to extract a characteristic structure that the filter has, from an original image. As illustrated in FIG. 22, features of small rectangular areas 2202 are extracted from a target image 2201, and then are used to create pixels of an image 2203 of the next layer. The rectangular area at this time is called a kernel and is an image data area used for calculating one element of the image of the next layer. By a multiply-add operation on this area using values defining feature quantities, a pixel value is generated. In the convolution operation, the number of instructions increases on an O (N2M2) order and accordingly, a size increase of the kernel results in an explosive increase of the number of instructions. This requires a large amount of resources such as buffers for storing the instructions and also necessitates decoding and issuing of the instructions in each cycle, leading to large power consumption.
Further, where the aforesaid software pipeline is employed, a decrease of the operation latency by the development of a successor model or the dynamic extension of the operation latency due to power saving control necessitates optimally arranging a sequence of instructions by recompiling. The recompiling is difficult in some case in a library or the like shared by many applications. FIG. 23 illustrates a timing chart when the operation latency becomes two cycles in the aforesaid example. By issuing a set of two instructions having no data dependency one after another so that two instructions are issued over two cycles as illustrated in FIG. 23, it is possible for all the stages (two stage) to efficiently work, but this requires the recompiling.