Field of the Disclosure
The present disclosure relates to a data processing apparatus used for pattern recognition, a method for controlling the data processing apparatus, and a storage medium storing a program.
Description of the Related Art
Neural network techniques have been widely applied to image processing apparatuses such as pattern recognition apparatuses. Among neural networks, operation techniques called Convolutional Neural Networks (hereinafter referred to as CNN) are attracting attention as techniques for achieving robust pattern recognition against variations of a recognition target. For example, Yann LeCun, Koray Kavukvuoglu and Clement Farabet: Convolutional Networks and Applications in Vision, Proc. International Symposium on Circuits and Systems (ISCAS'10), IEEE, 2010 discloses diverse examples of applications and implementations of convolutional neural networks (CNN). For CNN processing, diverse network configurations have been proposed according to recognition target signals and implementation target recognition functions. A configuration of a convolutional neural network indicates the number of layers, the number of feature planes in each layer, and other configurations which can be represented by combination relations between convolution operations.
FIG. 16 illustrates a network configuration representing an example of simple CNN processing. When performing CNN processing on image data, an input layer 1601 is equivalent to raster-scanned image data with a predetermined size. Feature planes 1603a to 1603c indicate feature planes of a first layer 1608. A feature plane refers to a data plane equivalent to a processing result of a predetermined feature extraction operation (convolution operation and nonlinear processing). A feature plane is equivalent to a feature extraction result for recognizing a predetermined target in a higher layer. Since a feature plane is a processing result for raster-scanned image data, the processing result is also represented as a plane.
The feature planes 1603a to 1603c are generated through convolution operations and nonlinear processing corresponding to the input layer 1601. For example, the feature plane 1603a is calculated through a convolution operation of a (schematically illustrated) two-dimensional filter kernel 16021a and a nonlinear conversion of an operation result.
For example, the convolution operation with a filter kernel (filter coefficient matrix) size of columnSize×rowSize is processed through the following product-sum operation.
                              output          ⁡                      (                          x              ,              y                        )                          =                              ∑                          row              =              0                        rowSize                    ⁢                                    ∑                              column                =                0                            columnSize                        ⁢                                          input                ⁡                                  (                                                            x                      +                      column                                        ,                                          y                      +                      row                                                        )                                            ×                              weight                ⁡                                  (                                      column                    ,                    row                                    )                                                                                        (        1        )            where “input (x, y)” denotes the reference pixel value at the coordinates (x, y), “output (x, y)” denotes the operation result at the coordinates (x, y), “weight (column, row)” denotes the weight coefficient at the coordinates (x+column, y+row), and “columnSize” and “rowSize” denote the convolution kernel sizes.
In CNN processing, the product-sum operation is repeated while a plurality of filter kernels is being scanned for each pixel and a nonlinear conversion is performed on the final product-sum result to calculate feature planes. Since the feature plane 1603a is calculated based on one piece of image data in a preceding layer, the number of combinations is 1. The number of kernels 16021a for calculating the feature plane 1603a is one. Filter kernels 16021b and 16021c are used to calculate feature planes 1603b and 1603c, respectively. A filter kernel may be abbreviated to a filter or kernel.
FIG. 17 illustrates an example for calculating a feature plane 1605a in the CNN processing. The feature plane 1605a is calculated from the three feature planes 1603a to 1603c of the preceding layer 1608 and is combined with the feature planes 1603a to 1603c. When calculating data of the feature plane 1605a, firstly, a filter operation using a (schematically illustrated) kernel 16041a is performed on the feature plane 1603a, and the result is held in a cumulative adder 1701. Likewise, convolution operations on the kernels 16042a and 16043a are performed for the feature planes 1603b and 1603c, respectively, and the result is cumulatively added to the cumulative adder 1701. Upon completion of the convolution operations using three different kernels, nonlinear conversion processing 1702 based on a logistic function or a hyperbolic tangent function (tank function) is performed.
Performing the above-described processing on the entire image data while scanning each pixel enables calculating the feature plane 1605a. Likewise, a feature plane 1605b is calculated by using three different convolution operations indicated by kernels 16041b, 16042b, and 16043b on the three different feature planes 1603a to 1603c of the preceding layer 1608, respectively. Further, a feature plane 1607 is calculated by using two different convolution operations indicated by kernels 16061 and 16062 on the two different feature planes 1605a and 1605b of a preceding layer 1609, respectively.
It is assumed that each kernel coefficient is predetermined through learning by using a general learning method such as perceptron learning or propagation learning. For example, in object detection and pattern recognition, a convolution kernel with a size of 10×10 or larger may be used.
In this way, CNN processing requires a huge number of product-sum operations since the CNN processing hierarchically uses a larger number of convolution operations with large kernel sizes. To support various recognition tasks by using common hardware, it is demanded to efficiently process diverse networks with a high concurrency level.
Japanese Patent Application Laid-Open No. 2010-134697 discusses an apparatus which achieves high speed processing by parallelly performing convolution operations corresponding to a plurality of reception field positions (pixel positions of calculation target feature planes) by using a plurality of product-sum operators. US Patent No. 2012/0303932 discusses a CNN processing apparatus configured to assign operators to convolution kernels.
Although the apparatus discussed in Japanese Patent Application Laid-Open No. 2010-134697 parallelly processes a plurality of reception fields focusing on one calculation target feature plane, the apparatus may be unable to efficiently perform parallel processing depending on the kernel sizes and processing target areas. For example, with small kernel sizes, the time required to supply data to be input to product-sum operators becomes a bottleneck, possibly degrading parallelization efficiency.