A graphics processing unit (GPU) used for an arithmetic processing device is originally a processor used for image processing; however, because the GPU includes a large number of floating-point product-sum computing units, which will be described later, and is optimized for matrix calculation, the GPU is often used as a processor that performs a process for machine learning. Furthermore, in general, the GPU is also used in a process of performing deep learning.
In deep learning, a process is usually performed by using neural networks. For example, in a case of deep learning in image recognition, there are two processes, i.e., a forward process of determining what the image provided is and a backward process of updating the parameters of the neural networks. The arithmetic processing device that performs deep learning performs the backward process by using a difference between each of the calculation results obtained in the forward process and an expected value and updates the parameters of the neural networks. Then, the arithmetic processing device improves the accuracy of the forward process by using the updated parameters.
The neural networks are constituted by a plurality of layers and, in each of the layers, an arithmetic operation process of, for example, extracting feature values is performed and the learning is repeated. In this way, neural networks have a multilayer structure in which a different arithmetic operation process is performed in each of the layers. Because of this structure, in order to update the parameters for each layer, learning is performed by obtaining a difference between the calculation result obtained in the last layer and an expected value, by propagating the difference to an immediately previous layer, and by further propagating the result of the calculated difference obtained from the subject layer. In a description here, immediately previous and immediately subsequent are described based on the forward direction of the forward process.
Furthermore, as the arithmetic operation process that is mainly used for image recognition in deep learning, there is a convolutional neural network. In the convolutional neural network, the operation referred to as convolution is frequently used. In a description below, this operation is called a “convolution operation”. For example, if image recognition is performed, a weight frame that has, in an area in an input image, a previously set parameter that is used as each of the elements is arranged in the original image. Then, by summing the multiplication of each of the elements of the input image in which the weight frame is arranged and each of the elements of the weight frame, the feature values in the area in which the weight frame is arranged in the input image are calculated. The arrangement of the weight frame with respect to the original image is performed on the entire input image by using the predetermined movement width of the weight frame and the sum of the calculated feature values corresponds to an output image that is output as the result of the convolution operation. The weight frame is sometimes referred to as a “filter”.
For example, consider, as an input image, an image having 8×8 elements, i.e., an image with 8×8-bit grayscale. In the following, this image is referred to as an 8×8 input image. Furthermore, a description will be given of a case of using a filter that has 4×4 elements and a case in which the filter is shifted for each column or each row in the input image. In the following, this filter is referred to as a 4×4 filter. Furthermore, in the following, the direction in which a row extends is referred to as “in the row direction” and the direction in which a column extends is referred to as “in the column direction”. In this case, if the 4×4 filter arranged at one of the corners of the 8×8 input image in the row direction is moved 5 (=8−3) times in the row direction, the 4×4 filter reaches the other corner. Namely, an output image has five elements in the row direction. Similarly, if the 4×4 filter arranged at one of the corners of the 8×8 input image in the column direction is moved 8−3 times in the column direction, the 4×4 filter reaches the other corner. Namely, the output image also has five elements in the column direction. Thus, the output image becomes a 5×5 image. Then, each of the elements in the output image corresponds to a total value of the multiplication of each of the elements included in the filter that is in the state of being arranged in the input image and each of the elements included in the input image associated with the respective elements in the filter.
When performing the operation of summing up the multiplied value described above, the arithmetic processing device usually uses an instruction called fused multiply add (FMA). The FMA is an instruction to a floating-point product-sum operation represented by the expression of (A×B)+C.
Furthermore, when performing such a convolution operation, in some cases, the single instruction multiple data (SIMD) method of simultaneously obtaining a plurality of outputs of operation results by simultaneously performing arithmetic operation processes on a plurality of pieces of data by executing a single instruction is used. For example, a description will be given of a case of operation that uses SIMD that processes, in parallel, four pieces of data. In the following, the SIMD that processes, in parallel, n pieces of data is referred to as n SIMD. Namely, the arithmetic operation process in this case can be referred to as a 4-way SIMD arithmetic operation process. Furthermore, in the following, the operation performed by using the SIMD is referred to as a SIMD operation.
In a case of convolution operation performed by using an 8×8 input image and a 4×4 filter described above, the arithmetic device can calculate, at a time, four values that are the results of the multiplication of one of the elements in the filter that is in each of the arrangement states in each of which the filter is shifted to each column four times and the associated element in the input images. Namely, when performing the 4-way SIMD operation, the arithmetic processing device can calculate, in parallel, the elements in the output image associated with the states of the filter arranged in the four different states.
When performing the arithmetic operation process using the SIMD described above, the arithmetic processing device stores, in registers that are used in the SIMD operation, the data used in the operation from among the pieces of the data on the input image stored in a memory that functions as a storage device and then performs a single operation. By repeating this process, the arithmetic processing device can perform the convolution operation. For example, in a case of the 4-way SIMD arithmetic operation process, the number of registers used for a single SIMD operation is four. When, in the SIMD operation, the arithmetic processing device stores data in the registers, the arithmetic processing device stores, at a time, the data in all of the registers included in the SIMD registers by using a load instruction of the SIMD.
Here, in the convolution operation, when a single element in the output image is calculated, each of the elements in the filter and each of the associated elements in the input image are used. Furthermore, in the convolution operation performed by using the SIMD, because the operation is repeatedly performed by shifting the range of the filter, the same data is used many times in the convolution operation performed in parallel.
Conventionally, in the convolution operation, multiplication of each of the elements and summing the multiplication results are correctively obtained for each arrangement state of a single filter. Thus, when calculating, in parallel, a plurality of computing units, such as in a case of using the SIMD, in order to improve a processing speed, a method of avoiding the use of same data by adjusting the order of calculations or a method of simultaneously using the data by preparing a copy of the same data is used.
For example, as a technology related to the convolution operation, there is a conventional technology that provides a multiplier for each line, that provides shift registers that store therein the weight of each line, that sequentially performs multiplication by shifting a value, and that adds the multiplication results. Furthermore, there is a conventional technology that provides a multiplier by being associated with each line such that the adjacent lines commonly use the multiplier and that performs the convolution operation. Furthermore, there is a conventional technology that divides line data in a memory into an area used for storing the data and an area used for storing weight data and that performs an operation by circulating the memory area. Furthermore, there is a conventional technology that performs an operation by passing an output of a multiplier to another multiplier. Furthermore, there is a conventional technology that eliminates multipliers and adders by simplifying arithmetic expressions.
Patent Document 1: Japanese Laid-open Patent Publication No. 2010-134697
Patent Document 2: Japanese Laid-open Patent Publication No. 2015-210709
Patent Document 3: Japanese Laid-open Patent Publication No. 2008-310700
Patent Document 4: Japanese Laid-open Patent Publication No. 2012-205298
Patent Document 5: Japanese Laid-open Patent Publication No. 2001-67338
However, when adjusting the order of calculations in order to avoid reading of the same data, multiplications or divisions are used to decide the data to be used. Because the multiplications or divisions consume a great number of cycles due to the operation when compared with the additions or subtractions, a calculation cost is high. Furthermore, during the operation of multiplications or divisions, there may be a case in which the computing units are not able to be operated for each cycle. Consequently, adjustment of the calculation order may possibly decrease the processing speed of the operation. Furthermore, when preparing a copy of data in order to avoid the reading of the same data, a sort order of pieces of data that are not probably used at the same time possibly becomes complicated or the number of pieces of data to be copied may possibly be increased. For example, if a moving distance of the filter at a time is equal to or greater than two columns and two rows, the data to be read varies in each of the computing units; therefore, the problem described above occurs. Namely, when using a processing method of correctively performing the operation for each arrangement state of a single filter, a calculation cost may possibly become high in order to improve the processing speed.
Furthermore, in also a case of using different data, depending on a method of moving data to the registers, there may be a state in which data is not able to be read from the registers. For example, if two computing units attempt to read data from the same register at the same timing, it may possibly be difficult to read the data. Thus, the processing speed of the operation may possibly be decreased.
Furthermore, in the backward process, because the size of input data is small and the number of pieces of the output data is great, the number of operations performed by using the same data is great. Thus, it is possible to efficiently perform the process by using a large number of computing units; however, if an operation is performed by using a conventional method by simply increasing the number of computing units, it is difficult to efficiently supply data to a large number of computing units.