Owing to the progress of process technology, many transistors have come to be integrated on a silicon chip. On the other hand, processing precision of 32 bits or 64 bits is often sufficient in arithmetic processing. Accordingly, as a processing method effectively using the many transistors, widely used is a SIMD (Single Instruction Multiple Data) method that processes a plurality of pieces of data by driving many arithmetic units in parallel with a single instruction (for example, refer to Patent Document 1).
In the SIMD method, for example, a plurality of pieces of 32-bit or 64-bit data are stored in a 128-bit or 256-bit vector register. Four arithmetic operations of the vector data are executed in such a manner that a plurality of arithmetic units for the four arithmetic operations are arranged in line as illustrated in FIG. 9 as an example, and the arithmetic units perform the four arithmetic operations of pieces of corresponding data. FIG. 9 illustrates, as an example, a processing unit 100 which has four multipliers 101-i (i=1, 2, 3, 4) and calculates products of vector data each having four elements. The multiplier 101-i receives data a(i−1) and data b(i−1) each being one element of the input vector data a, b and outputs a product of the data a(i−1) and the data b(i−1) as data c(i−1) which becomes one element of output vector data c.
The currently available SIMD method is often used for supplying data to many arithmetic units in one cycle, and is called a short-vector SIMD method because its vector register length is several hundred bits at the maximum which is shorter than a conventional vector register length being several thousand bits. A vector operation is suitable for efficiently processing a matrix operation often used in scientific and technical calculation. Hereinafter, as an example, a description will be given of arithmetic processing in which, regarding two-dimensional N×N (N is an integer equal to 2 or more) matrices A, B, C, a product of the matrix A and the matrix B is added to the matrix C.
FIG. 10 is a flowchart representing an example of processing in which the product of the matrix A and the matrix B is added to the matrix C by scalar processing. When the processing starts, a value of a variable j is initialized to 0 at step S301. Next, at step S302, the value of the variable j is checked, and when the value of the variable j is smaller than N, the processing goes to step S303, and otherwise, the processing is ended. At step S303, a value of a variable i is initialized to 0. Next, at step S304, the value of the variable i is checked, and when the value of the variable i is smaller than N, the processing goes to step S305, and otherwise, 1 is added to the value of the variable j at step S310, and the processing goes to step S302. At step S305, a value of a variable k is initialized to 0. Next, at step S306, the value of the variable k is checked, and when the value of the variable k is smaller than N, the processing goes to step S307, and otherwise, 1 is added to the value of the variable i at step S309, and the processing goes to step S304. At step S307, an arithmetic operation is executed in which a product of data A[j][k] at the (j+1)-th row and the (k+1)-th column of the matrix A and data B[k][i] at the (k+1)-th row and the (i+1)-th column of the matrix B is added to data C[j][i] at the (j+1)-th row and the (i+1)-th column of the matrix C, and the addition result is set as data at the (j+1)-th row and the (i+1)-th column of the matrix C. Subsequently, at step S308, 1 is added to the value of the variable k, and the processing goes to step S306. When the processing is executed by the scalar processing, the calculation of the product and the sum is performed N3 times by triple loop processing of the variables i, j, k as represented in FIG. 10. The number of instructions or the processing represented in FIG. 10 is N3.
FIG. 11 is a flowchart representing an example of processing in which a product of a matrix A and a matrix B is added to a matrix C by vector processing of four elements. Processes at steps S401 to S406 and steps S408 and S409 represented in FIG. 11 correspond to the processes at steps S301 to S306 and steps S308 and S309 represented in FIG. 10. The contents of the processes are the same, and therefore, a description thereof will be omitted. At step S407 to which the processing goes when a value of a variable k is smaller than N at step S406, an arithmetic operation is executed in which a product of data A[j+x] [k] (x=0, 1, 2, 3) at the (j+x+1)-th row and the (k+1)-th column of the matrix A and data B[k][i] at the (k+1)-th row and the (i+1)-th column of the matrix B is added to data C[j+x][i] at the (j+x+1)-th and the (i+1)-th column of the matrix C, and the addition result is set as data at the (j+x+1)-th row and the (i+1)-th column of the matrix C. That is, in the example represented in FIG. 11, the vector product operation and the vector sum operation for the continuous four elements are executed with a single instruction. At step S410 to which the processing goes when a value of a variable i is not smaller than N at step S404, 4 is added to a value of a variable j and the processing goes to step S402. When the processing is executed by the vector processing represented in FIG. 11, the vector product operation and the vector sum operation are executed with the continuous four elements, and therefore, the product and sum operations are executed N3 times by (N3/4) instructions.
There has been proposed a processor which executes processing by supplying data of different elements of the same vector register to a plurality of vector arithmetic units capable of executing the same processing, thereby effectively using the vector arithmetic unit not in use to increase the number of elements processed per cycle, enabling an improvement in processing power (for example, refer to Patent Document 2).
[Patent Document 1] National Publication of Translated Version of International Patent Application No. 2008-519349
[Patent Document 2] Japanese Laid-open Patent Publication No. 10-312374
In a semiconductor integrated circuit, power consumption has become a problem since the number of transistors integrated therein is increasing even though a power supply voltage does not decrease from about 1 V under the current process technology. In particular, data move on a silicon chip consumes large power, and it has become important to reduce the number of inputs and outputs to and from a data storage unit such as a register and to dispose the data storage unit and a processing unit close to each other. The matrix product operation to find the product of the two matrices is one of processing involving many data moves between the data storage unit where the data of the matrices are stored and the processing unit. In an arithmetic operation regarding one element, the matrix product operation does not use the same data, and therefore, the number of times of the data input to the processing unit is not reduced even when, for example, a vector operation is used as the arithmetic operation relating to one element, and it is not possible to reduce power consumption of the processor which executes the matrix product processing.