In recent years, artificial intelligence can perform processing such as identification and prediction highly accurately by performing machine learning. Further, in machine learning, a technique referred to as “deep machine learning (deep learning)”, which is a learning method using a neural network having a multiple structure, is attracting attention. As for learning using the deep learning, many elements of matrix calculation are used.
A GPU (Graphics Processing Unit) that is used for an arithmetic processing device is originally a processor for image processing. However, because the GPU includes a plurality of product-sum arithmetic units and is suitable for matrix calculation, it is frequently used as a processor that performs processing for signal learning. Also in the processing for performing deep learning, it is a common procedure to use a GPU.
The deep learning includes processing referred to as “convolutional neural network”, which is mainly used in image recognition. In the convolutional neural network, an operation referred to as “convolution” is frequently used. In the following descriptions, it is referred to as “convolution operation”. For example, when image recognition is performed, a weight frame having predetermined parameters as respective elements is arranged in an original image, in a region on an input image. By adding the respective elements in the input image in which the weight frame is arranged and the respective elements in the weight frame, a feature amount of the region in the input image in which the weight frame is arranged is calculated. The arrangement of the weight frame in the original image is performed for the entire input image by using a predetermined shift width of the weight frame, and an integrated amount of the calculated feature amounts becomes an output image to be output as a result of convolution operation. The weight frame may be referred to as “filter”.
For example, as an input image, an image having 8×8 elements, that is, an 8×8 bit gray scale image, is considered here. In the following descriptions, the image is referred to as “8×8 input image”. A case where a filter having 4×4 elements is used and the filter is shifted by one column or by one row in the input image is described here. In the following descriptions, the filter is referred to as “4×4 filter”. In this case, if the filter arranged at one end of the input image in a row direction is shifted by 8−3 times, the filter reaches the other end of the input image. That is, an output image has five elements in the row direction. Similarly, if the filter arranged at one end of the input image in a column direction is shifted by 8−3=5 times, the filter reaches the other end of the input image. That is, an output image has five elements in the column direction. Therefore, the output image becomes a 5×5 image. The respective elements of the output image have a total value obtained by multiplying the respective elements of the filter by the respective elements of the input image at the positions corresponding to the respective elements, in a state in which the filter is arranged in the input image.
As described above, at the time of carrying out an operation of adding multiplied values, the arithmetic processing device frequently uses a command referred to as “fma (Fused Multiply Add)”. The fma is a command to perform a product-sum operation of a floating point expressed in a form of (A×B)+C.
Further, when such a convolution operation is performed, there may be a case of using a method referred to as “SIMD (Single Instruction Multiple Data)”, in which one command is simultaneously applied to a plurality of pieces of data and a plurality of operations are performed in parallel to obtain a plurality of outputs simultaneously. As an example, an operation using the SIMD processing four pieces of data in parallel is described. In the following descriptions, the SIMD that processes n pieces of data in parallel is referred to as “nSIMD”. That is, the arithmetic processing in this case can be referred to as “4SIMD”. In the following descriptions, the operation using the SIMD is referred to as “SIMD operation”.
In the case of the convolution operation using the 8×8 input image and the 4×4 filter described above, an arithmetic device can calculate at a time four values, which are results of multiplication of one element of a filter, in an arranged state in which filters are shifted by one column four times, by a corresponding element of the input image. That is, in the case of performing the 4SIMD operation, the arithmetic processing device can calculate elements of an output image corresponding to the states of the filter in four different arrangements.
In the case of performing the arithmetic operation using the SIMD, the arithmetic processing device performs one operation after storing data to be used for one operation, among the pieces of input image data stored in a memory, in a register used in the SIMD operation. By repeating this processing, the arithmetic processing device can perform the convolution operation. For example, in the case of the 4SIMD arithmetic processing, there are four registers to be used for one SIMD operation. The registers of the number to be used for one SIMD operation in this manner are collectively referred to as “one SIMD register”. At the time of storing the data in the SIMD register, the arithmetic processing device uses an SIMD load command to store the data in all the registers of the SIMD register at a time.
In the convolution operation, when obtaining one element of the output image, respective elements of the filter and respective elements of the input image corresponding thereto are used. Further, in the convolution operation using the SIMD, a value used in one convolution operation of the parallel convolution operations is used for other convolution operations. Therefore, when performing the convolution operation using the SIMD, it is desired to share the value stored in the respective registers of the SIMD register in the parallel convolution operations. However, the filter is shifted by predetermined columns and predetermined rows on an input screen. Therefore, the values stored in respective registers of the SIMD register are shared with other convolution operations, and the used value is discarded and a new value is stored in the register. In the following descriptions, the processing in which a used value is deleted, another value is shared for other convolution operations, and a new value is stored in the register is referred to as “rotate”.
An SIMD command includes, for example, a shuffle command and a broadcast command. The shuffle command is a command to replace data stored in the register. The broadcast command is a command to copy data stored in one register and arrange the data in other registers. Conventionally, the shuffle command has been used as the SIMD command to realize rotate.
For example, as the technique related to the convolution operation, there is a conventional technique in which a multiplier is provided for each line, a shift register that stores weights of the respective lines is provided, multiplication is performed sequentially by shifting a value, and the multiplication results are added. There is another conventional technique in which a multiplier is provided corresponding to each line so as to share the multiplier between adjacent lines to perform a convolution operation. There is also a conventional technique in which a memory is divided into a region for storing line data and a region for storing weight data, and memory regions are circulated to perform an operation. There is also a conventional technique of performing an operation by delivering an output of a multiplier to another multiplier. There is also a conventional technique in which an SIMD register has a bank configuration, and data at an arbitrary position of an arbitrary register is set as data to be supplied to each arithmetic unit, thereby making rearrangement of data in the register unnecessary. There is another conventional technique in which, at the time of loading data into an SIMD register, the data is copied to a buffer register, and data at an arbitrary position of an arbitrary register is set as data to be supplied to respective arithmetic units, thereby making rearrangement of data in the register unnecessary.
However, the SIMD command accesses the register in the same SIMD register. That is, by the SIMD command, it is difficult to shift data in a register in a certain SIMD register to a register in another SIMD register. Therefore, for example, when performing rotate, other than the shuffle command, a command to retrieve data from a certain SIMD register and shift the data to another SIMD register is added. In order to realize the rotate by using the SIMD command as described above, another command needs to be used, thereby making the processing redundant to decrease the arithmetic processing speed.
Further, in the conventional technique of sequentially performing multiplication by using a shift register provided for each line that stores the weight of each line, and a conventional technique in which a multiplier is provided to share the multiplier by adjacent lines, the SIMD command is not taken into consideration. Further, in the conventional technique of performing an operation by dividing a memory into a region for storing a line data and a region for storing weight data, and the conventional technique of performing an operation by delivering an output of a multiplier to another multiplier, the SIMD command is not taken into consideration. Therefore, even if these conventional techniques are used, it is difficult to improve the arithmetic processing speed.