Field of the Invention
The present invention relates to a convolution operation apparatus.
Description of the Related Art
An apparatus which applies a neural network technique has been widely proposed as a pattern recognition apparatus. Especially, among neural networks, an operation processing method called convolutional neural networks (to be referred to as CNNs hereinafter) is known as a method of allowing pattern recognition robust against variations of a recognition target. As an example in which such method is applied, Japanese Patent Laid-Open No. 10-021406 proposes a technique of performing face recognition using image data.
An example of a CNN operation will now be described.
FIG. 23 is a block diagram showing an example in which a CNN operation for image data is implemented by neural networks.
Since FIG. 23 shows a case in which a CNN operation is performed for image data, an input layer 2301 is raster-scanned image data of a predetermined size. Feature planes 2303a to 2303c indicate feature planes in a first layer 2308. The feature plane is a data plane indicating the detection result of a predetermined feature extraction filter (convolution operation and non-linear processing). If, for example, a face is detected, the feature plane is a data plane indicating a detection result of eyes, a mouth, a nose, or the like. Since this data plane indicates the detection result of feature extraction for the image data obtained by a raster scan, the detection result is also expressed by a plane. Each of the feature planes 2303a to 2303c is generated by performing a convolution operation and non-linear processing for the input layer 2301. For example, the feature plane 2303a is obtained by performing a convolution operation schematically represented by a kernel 2311a, and performing non-linear transformation for the operation result. Note that kernels 2311b and 2311c of the filter shown in FIG. 23 are used to generate the feature planes 2303b and 2303c, respectively. Feature planes 2305a and 2305b indicate feature planes in a second layer 2309, and a feature plane 2307 indicates a feature plane in a third layer 2310.
FIG. 24 is a view showing an example of a kernel 2442 of a convolution filter.
Referring to FIG. 24, a data string 2441 is a data string indicating the reference pixels of image data obtained by a raster scan, and the kernel 2442 of the filter indicates an example of a kernel for the reference pixels. This example corresponds to execution of an FIR (Finite Impulse Response) filter operation using the kernel having a size of 5×5. The FIR filter operation is processed by a product-sum calculation given by:
                              output          ⁢                                          ⁢                      (                          x              ,              y                        )                          =                              ∑                          row              =              0                                      rowSize              -              1                                ⁢                                          ⁢                                    ∑                              column                =                0                                            columnSize                -                1                                      ⁢                                                  ⁢                          (                              input                ⁢                                                                  ⁢                                  (                                                            x                      +                      column                                        ,                                          y                      +                      row                                                        )                                ×                weight                ⁢                                                                  ⁢                                  (                                      column                    ,                    row                                    )                                            )                                                          (        1        )            
where “input(x, y)” represents a reference pixel value at coordinates (x, y), “output(x, y)” represents an FIR filter operation result at the coordinates (x, y), “weight(column, row)” represents an FIR filter coefficient at coordinates (x+column, y+row), and “columnSize” and “rowSize” represent the sizes of the kernel, which are 5 in the example of FIG. 24.
When calculating the feature plane 2303a shown in FIG. 23, the data string 2441 corresponds to the input layer 2301, and the kernel 2442 corresponds to the kernel 2311a. In a CNN operation, a product-sum calculation is repeated while scanning the kernels of a plurality of filters for respective pixels, thereby obtaining a final product-sum result. A feature plane is then generated by further performing non-linear transformation for the product-sum result. Note that when calculating the feature plane 2303a, the number of connections to the previous layer is “1”, and thus the number of kernels is 1.
An operation of generating the feature plane 2305a in the second layer 2309 will be described next.
The feature plane 2305a is connected to the three feature planes 2303a to 2303c in the previous first layer 2308. Therefore, when calculating data of the feature plane 2305a, a filter operation is performed for the feature plane 2303a using a kernel schematically indicated by a kernel 2312a, and the result is held in an accumulator. Similarly, a filter operation using a kernel 2313a and that using a kernel 2314a are performed for the feature planes 2303b and 2303c, respectively, and the results are accumulated in the accumulator. After the end of these three kinds of filter operations, non-linear transformation processing is performed using a logistic function or hyperbolic function (tank function). By performing the above processing for the entire image while scanning pixels one by one, the feature plane 2305a is generated.
Similarly, when generating the feature plane 2305b, three convolution filter operations are performed using kernels 2312b, 2313b, and 2314b for the feature planes 2303a to 2303c in the previous layer 2308. When generating the feature plane 2307 in the third layer 2310, two convolution filter operations are performed using kernels 2315 and 2316 for the feature planes 2305a and 2305b in the previous layer 2309. Note that respective filter coefficients have been determined in advance by learning using a general method such as back propagation learning or deep learning. In, for example, detection or recognition of an object, a kernel of a size of 10×10 or larger is often used.
In CNN operation processing, a number of filters of large kernel sizes are hierarchically used, thereby requiring an enormous number of convolution operations. As a method of coping with this problem, for example, a technique of reducing the number of times of execution of a product-sum calculation in a convolution operation by decomposing filter coefficients into one-dimensional base filter coefficients is proposed in Denton et al. “Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation”, CoRR2014 (Hereinafter referred to as Denton).
On the other hand, if a convolution operation in CNNs is implemented as software operating on a processor, the number of convolution operations is enormous, as described above. Thus, a desired operation speed may be difficult to achieve. As a method of coping with this problem, for example, U.S. Patent Application Publication No. 2012/0303932 proposes a technique of implementing a CNN operation by digital hardware.