As applications to a pattern identification system, prediction system, control system, and the like, a signal processing apparatus using a neural network is prevalently used. In general, the neural network is often implemented as software which runs on a microprocessor, and is provided as application software for a personal computer, workstation, and the like.
FIG. 2 is a schematic block diagram showing an example of the arrangement of an image processing apparatus using a general hierarchically coupled neural network. Reference numeral 21 denotes data of a detection target, for example, raster-scanned image data. Reference numeral 22 denotes a calculation unit which detects a predetermined object from the image data 21, and comprises a neural network of three layers in the illustrated example. Reference numeral 23 denotes an output data plane corresponding to the calculation result. The calculation unit 22 executes processing while scanning and referring to a predetermined image area 24 in the image data 21, thereby detecting a detection target which exists in the image. The output data plane 23 is a data plane having the same size as the image data 21 as the detection target. The output data plane 23 stores, in the scan order, detection outputs of the calculation unit 22 which processes all the areas of the image data 21 while scanning them. Since the calculation unit 22 outputs a large value at a position where a target is detected, it can recognize the position of the target in the image plane by scanning the output data plane 23.
In the calculation unit 22, reference numerals 25, 26, and 27 denote layers of the neural network, and a predetermined number of neurons 28 exist in each layer. The first layer 25 has the same number of neurons (nodes) 28 as the number of pixels of a reference image. Respective neurons are feedforward-coupled via predetermined weighted coefficients.
FIG. 3 is a block diagram showing an example of the arrangement of one neuron 28. Reference symbols in_1 to in_n denote input values, which are output values of the previous layer neurons in the second and subsequent layers. An accumulation adder 32 accumulates products of the input values and coefficients w_1 to w_n obtained by learning. A non-linear conversion processing unit 33 non-linearly converts the accumulated sum from the accumulation adder 32 using a logistic function, hyperbolic tangent function (tanh function), or the like, and outputs that conversion result as a detection result “out”. Assume that that the weighted coefficients w_1 to w_n required for respective neurons in the hierarchical neural network are determined for respective targets to be detected using a learning algorithm such as back propagation, or the like, which is generally known.
For the purpose of high-performance and low-cost implementation of such hierarchically coupled neural network in an embedded device or the like, an implementation method using analog hardware or digital hardware has been proposed.
Japanese Patent No. 2679730 (patent reference 1) discloses an architecture of a hierarchical structure neural network which implements a multilayered structure using single-layer analog neural network hardware as time division multiplexing. Also, Japanese Patent Laid-Open No. 3-55658 (patent reference 2) discloses an implementation method using digital hardware.
On the other hand, a calculation method using a neural network called Convolutional Neural Networks is known as a method that allows pattern recognition robust against variations of an identification target. The Convolutional Neural Networks will be abbreviated as “CNN” hereinafter. For example, Japanese Patent Laid-Open No. 10-021406 (patent reference 3) and Japanese Patent Laid-Open No. 2002-358500 (patent reference 4) have proposed examples in which CNN calculations are applied to target identification or detection in an image.
FIG. 4 is a block diagram showing the logical network configuration as an example of simple CNN. FIG. 4 shows an example of three-layer CNN in which the number of features of a first layer 406 is 3, that of a second layer 410 is 2, and that of a third layer 411 is 1. Reference numeral 401 denotes image data, which corresponds to raster-scanned image data. Reference numerals 403a to 403c denote feature planes of the first layer 406. The feature plane is an image data plane indicating the results obtained by calculations while scanning data of the previous layer using a predetermined feature extraction filter (the accumulated sum of convolution calculations and non-linear processing). The feature plane is expressed by a plane since it is defined by the detection results for the raster-scanned image data. The feature planes 403a to 403c are generated from the image data 401 by corresponding feature extraction filters. For example, the feature planes 403a to 403c are generated by two-dimensional convolution filter calculations typically corresponding to convolution kernels 404a to 404c, and the non-linear conversion of the calculation results. Note that reference numeral 402 denotes a reference image area required for the convolution calculations.
For example, a convolution filter calculation having a kernel size (the length in the horizontal direction and the height in the vertical direction) of 11×11 processes data by a product-sum calculation given by:
                              output          ⁡                      (                          x              ,              y                        )                          =                              ∑                          row              =                                                -                  RowSize                                /                2                                                    rowSize              /              2                                ⁢                                    ∑                              column                =                                                      -                    columnSize                                    /                  2                                                            columnSize                /                2                                      ⁢                                          input                ⁡                                  (                                                            x                      +                      column                                        ,                                          y                      +                      row                                                        )                                            ×                              weight                ⁡                                  (                                      column                    ,                    row                                    )                                                                                        (        1        )            where
input(x, y): a reference pixel value at coordinates (x, y)
output(x, y): a calculation result at coordinates (x, y)
weight(column, row): a weighted coefficient at coordinates (x+column, y+row)
columnSize=11, rowSize=11: a filter kernel size (the number of filter taps).
The convolution kernels 404a to 404c respectively have different coefficients. Also, the convolution kernels 404a to 404c have different sizes depending on the feature planes.
The CNN calculations generate the feature plane by repeating the product-sum calculation while scanning a plurality of filter kernels for respective pixels, and by non-linearly converting the final product-sum result. Upon calculating the feature plane 403a, since the number of couplings with the previous layer is 1, one convolution kernel 404a is used. On the other hand, upon calculating each of feature planes 407a and 407b, since the number of couplings with the previous layer is 3, the calculation results of three convolution filters 409a to 409c or 409d to 409f are accumulated. That is, the feature plane 407a can be generated by accumulating the outputs from the convolution filters 409a to 409c, and finally executing the non-linear conversion processing of the sum.
Note that the convolution kernels 409a to 409f respectively have different filter coefficients. The convolution kernels 409a to 409c and convolution kernels 409d to 409f respectively have different kernel sizes, as shown in FIG. 4. The basic arrangement of the accumulation of convolution filters and the non-linear conversion processing is the same as that of the neuron shown in FIG. 3. That is, the coefficients of the convolution kernels correspond to the weighted coefficients w_1 to w_n. Upon coupling to a plurality of feature planes of the previous layers like the feature planes 407a, 407b, and 408, the accumulation adder 32 accumulates calculation results of a plurality of convolution kernels. That is, the total number of couplings corresponds to the convolution kernel size×the number of features of the previous layer.
FIGS. 5A to 5C are views for explaining graphic detection processing in the CNN calculations. Reference numerals 51a to 51c denote convolution kernels which illustrate feature extraction targets of the first layer 406, and are learned to respectively extract a horizontal edge and oblique edges. Reference numerals 52a and 52b denote graphics which are extracted by the second layer 410 based on a plurality of feature extraction results of the first layer and their spatial allocation relationships. Reference numeral 53 denotes a graphic to be finally extracted. The graphic 53 is extracted by the third layer 411 based on a plurality of second layer feature extraction results of the second layer 410 and their spatial allocation relationships. Assume that the filter coefficients of the convolution kernels are determined in advance for respective features by learning using a prevalent method such as perceptron learning, back propagation learning, or the like. In object detection, recognition, and the like, a filter kernel having a size as large as 10×10 or more is normally used. In general, convolution kernel sizes are different for respective features.
In this way, the CNN calculations can implement robust pattern detection based on primitive features and their spatial allocation relationships by hierarchically coupling layers while holding the results by respective image planes for respective feature extractions.
As has been described using FIG. 2, in an apparatus for detecting an object in an image, which uses a general hierarchical neutral network, as the memory size required for calculation processing, a buffer memory used to hold respective neuron outputs suffices except for input and output image buffers. That is, if a memory having the predetermined number of bits as many as the number of neurons is provided, desired calculation processing can be executed.
On the other hand, in case of the CNN calculations, since feature extraction is made based on the spatial allocation of a plurality of feature extraction results of the previous layer, data buffers of a predetermined size are required between adjacent layers. For example, in case of the CNN calculation configuration example shown in FIG. 4, the image size×five feature plane buffer memories need to be prepared (feature planes 403a to 403c, and feature planes 407a and 407b) except for input and output image buffers. For this reason, a memory size required for processing becomes larger than a general hierarchical neural network.
The methods described in patent references 3 and 4 are also those which hold the feature extraction results by image planes, and the memory size required for processing is larger than a method using a general hierarchical neural network.
As a result, particularly, in case of hardware implementation, a RAM (Random Access Memory) having a large size needs to be prepared in an LSI, resulting in increases in circuit scale. Even in the case of software implementation, upon implementing in an embedded device, the cost similarly increases due to an increase in memory size required for the system. That is, the memory size that can be used for calculations is a finite value specified by the cost that can be spent for the system.
On the other hand, as a method of avoiding an increase in memory size, a method of dividing input data into areas and inputting the divided data is used. However, when calculations for a broad reference area are to be hierarchically processed, since data to be divisionally input need to overlap each other over a broad range, processing target areas consequently increase, thus lowering the processing efficiency and processing speed.