A GPU (Graphic Processing Unit) used as an arithmetic processing device is originally a processor for image processing. However, the GPU is often used as a processor for machine learning since the GPU has a number of floating point multiply-add arithmetic circuits to be described later and is optimized for a matrix calculation. In addition, the GPU is generally used for a large amount of matrix calculation in deep learning.
In many cases, the deep learning uses a neural network to perform a processing. For example, the deep learning for image recognition includes two processes, that is, a forward process for determining what is a given image and a backward process for updating parameters of a neural network for determination. An arithmetic processing device for the deep learning uses a difference between a calculation result in the forward process and an expected value to perform the backward process so as to update the parameters of the neural network. Then, the arithmetic processing device uses the updated parameters to improve the accuracy of the forward process.
The neural network is composed of plural layers and an arithmetic processing such as a feature amount extraction is performed and learning is repeated in each layer. In this way, the neural network has a multilayer structure in which a different arithmetic processing is performed in each layer. With such a multilayer structure, in order to update the parameters for each layer, a difference between a calculation result of the subsequent layer and an expected value is obtained, and the learning is performed while propagating the difference to a preceding layer and propagating a result of difference calculation in the preceding layer to a layer previous to the preceding layer. The terms “preceding” and “previous” used herein are based on a direction in which the forward process proceeds.
An arithmetic processing mainly used for image recognition in the deep learning may include a process called a convolution neural network in which an operation called convolution (hereinafter, referred to as “convolution operation”) is frequently used. For example, for image recognition, a weighting frame having predetermined parameters as elements in a region on an input image is disposed in the original image. Then, the results of multiplication of each element of the input image on which the weighting frame is disposed and each element of the weighting frame are added up to calculate the feature amount of the region on the input image in which the weighting frame is disposed. The integration of feature amounts calculated by performing the disposition of the weighting frame on the original image for the entire input image using the movement width of the predetermined weighting frame becomes an output image output as the result of the convolution operation. The weighting frame is e sometimes called a “filter”.
For example, an image having 8×8 elements, that is, an 8×8 bit grayscale image (hereinafter, referred to as an 8×8 input image), is considered as an input image. Further, descriptions will be made of a case where a filter having 4×4 elements (hereinafter, referred to as a 4×4 filer) is used and is shifted column by column or row by row in the input image. In the following description, the direction in which a row extends is referred to as a “row direction” and a direction in which a column extends is referred to as a “column direction”. In this case, when a 4×4 filter arranged at one end in the row direction of the 8×8 input image is moved 5 (=8-3) times in the row direction, the 4×4 filter reaches the other end. That is, the output image has five elements in the row direction. Similarly, when a 4×4 filter arranged at one end in the column direction of the 8×8 input image is moved 5 (=8-3) times in the column direction, the 4×4 filter reaches the other end. That is, the output image has five elements in the column direction as well. Therefore, the output image becomes a 5×5 image. Each element of the output image has a total value obtained by multiplying each element of the filter arranged on the input image and each element of the input image at a position corresponding to each element.
In a case of performing an operation for summing such multiplied values, an arithmetic processing device often uses an instruction (command) called fma (Fused Multiply Add). The fma is an instruction that performs a floating point multiply-add operation represented by (A×B)+C.
Further, when performing such a convolution operation, a method called SIMD (Single Instruction Multiple Data) is used in which one instruction is executed to simultaneously perform an arithmetic processing on plural pieces of data and obtain plural operation result outputs at the same time. For example, an operation using SIMD for processing four pieces of data in parallel will be described. In the following description, the SIMD that processes n pieces of data in parallel is called nSIMD. That is, the arithmetic processing in this case may be said to be 4SIMD arithmetic processing. In the following description, the operation using SIMD is referred to as a SIMD operation.
In the case of convolution operation using the 8×8 input image and the 4×4 filter described above, the arithmetic processing device may calculate four values at a time, which are obtained by multiplying one element of the filter in each disposition state in which the filter is shifted four times in one row, by a corresponding element of the input image. That is, when performing the 4SIMD operation, the arithmetic processing device may calculate elements of the output image corresponding to the state of the filters of four different arrangements in parallel.
In the case of performing such an arithmetic processing using SIMD, the arithmetic processing device stores data to be used for arithmetic operation out of data of the input image stored in a memory as a storage device, in a register used for the SIMD operation, and then performs one arithmetic operation. By repeating this processing, the arithmetic processing device may perform a convolution operation. For example, in the case of a 4SIMD arithmetic processing, four registers are used for one SIMD arithmetic operation. When storing data in registers in the SIMD operation, the arithmetic processing device uses a SIMD load instruction to store data in all the registers of the SIMD register at a time.
Here, in the convolution operation, each element of the filter and a corresponding element of the input image are used to obtain one element of the output image. Further, in the convolution operation using SIMD, since the iterative operation is performed while shifting the range of the filter, the same data is used many times in the concurrent convolution operation.
In a convolution operation in the related art, the multiplication of elements and the sum of multiplication results are collectively performed for each disposition state of one filter. Therefore, when a parallel calculation is performed in plural arithmetic circuits as in the case of using SIMD, in order to improve the processing speed, it is necessary to adjust the calculation order to avoid using the same data or to prepare and use the copying of the same data at the same time.
As a technique of the convolution operation, there is a conventional technique of a semiconductor integrated circuit in which the ranges of data lines accessible by adjacent arithmetic circuits overlap with each other. Further, as a technique for parallel processing of operations, there is a conventional technique of performing an arithmetic operation by using data expressions in which intermediate outputs of operation elements are multiplexed.
However, when adjusting the calculation order in order to avoid reading of the same data, a multiplication or a division is used to determine data to be used. Since the multiplication or the division is high in costs since they consume a large number of cycles by the arithmetic operation, as compared with an addition or a subtraction. In addition, during the arithmetic operation of the multiplication or division, there is a possibility that an arithmetic circuit may not operate every cycle. For this reason, adjusting the calculation order may lower the processing speed of the arithmetic operation. In addition, when preparing the copy of data in order to avoid reading of the same data, there is a possibility that rearrangement of data which is not used at the same time becomes complicated and the number of data to be copied increases. For example, when a single movement distance of the filter is two rows and two columns or more, the above-described problem arises because the data to be read is broken into each arithmetic circuit. That is, in the case of using a processing method which performs arithmetic operations collectively for each filter arrangement state, there is a possibility that high calculation cost may be incurred to improve the processing speed.
In addition, even when different data is used, depending on a method of moving data to a register, the data may not be read from the register. For example, when two arithmetic circuits attempt to read data from the same register at the same timing, it may be difficult to read the data. Therefore, there is a possibility that the processing speed of the arithmetic operation may be lowered.
In particular, in the forward operation, when plural arithmetic circuits are used, it is difficult to input appropriate data in order to avoid conflicts between the arithmetic circuits, which makes it difficult to improve the arithmetic processing speed.
Further, even when a semiconductor integrated circuit in which the ranges of data lines accessible by adjacent arithmetic circuits overlap with each other is used, it is difficult to suppress occurrence of conflicts between two or more arithmetic circuits. Furthermore, even with the conventional technique using data expressions in which intermediate outputs are multiplexed, there is a high possibility that conflicts in data input autobiography may be caused, which makes it difficult to improve the arithmetic processing speed.
Related techniques are disclosed in, for example, Japanese Laid-Open Patent Publication Nos. 07-282237 and 2005-346470.