As applications to a pattern identification system, prediction system, control system, and the like, a signal processing apparatus using a neural network is prevalently used. The neural network is often implemented as software which runs on a microprocessor, and is provided as application software for a personal computer, workstation, and the like.
FIG. 14 is a schematic diagram showing an example of the arrangement of an image processing apparatus using a general layer-interconnected neural network. Referring to FIG. 14, reference numeral 21 denotes detection target data, for example, raster-scanned image data. Reference numeral 22 denotes a calculation unit which detects a predetermined object from an image, and comprises a neural network of three layers in the example of FIG. 14. Reference numeral 23 denotes an output data plane corresponding to the calculation result. The calculation unit 22 executes processing while scanning and referring to a predetermined image area 24, thereby detecting a detection target which exists in the image. The output data plane 23 is an image plane having the same size as the image data 21 as the detection target, and stores detection outputs obtained when the calculation unit 22 processes all the areas of the image data 21 while scanning them. Since the calculation unit 22 outputs a large value at a position where a target is detected, it can recognize the position of the target in the image plane by scanning the output data plane 23. In the calculation unit 22, reference numerals 25, 26, and 27 denote layers of the neural network, and a predetermined number of neurons 28 exist in each layer. The first layer 25 has the same number of nodes, that is, neurons 28 as the number of pixels of a reference image. Respective neurons are feedforward-interconnected via predetermined weighting coefficients. FIG. 15 shows the arrangement of one neuron 28. Reference numerals in_1 to in_n denote input values to this processing node, which are detection target image data in the first layer, and neuron output values of the previous layer in the second and subsequent layers. Multipliers 31a, 31b, . . . , 31n output products obtained by multiplying the output values of the respective previous layer neurons by coefficients w_1 to w_n obtained by learning. An accumulation adder 32 accumulates the products from the multipliers 31a, 31b, . . . , 31n. A nonlinear transformation processing unit 33 nonlinearly transforms the accumulated sum of the accumulation adder 32 using a logistic function, hyperbolic tangent function (tan h function), or the like, and outputs that result as a detection result “out”. In the hierarchical neural network, the weighting coefficients w_1 to w_n required for respective neurons are determined in advance in accordance with a detection target using a learning algorithm such as back propagation, or the like, which is generally known.
For the purpose of low-cost implementation of such layer-interconnected neural network in an embedded device or the like, an implementation method using analog hardware or digital hardware has been proposed. For example, Japanese Patent No. 2679730 (patent reference 1) discloses an architecture of a hierarchical structure neural network which implements a multilayered structure using single-layer analog neural network hardware as time division multiplexing. Also, Japanese Patent Laid-Open No. 03-055658 (patent reference 2) discloses an implementation method using digital hardware.
On the other hand, a calculation method called Convolutional Neural Networks (to be abbreviated as CNN hereinafter) of neural networks is known as a method that allows pattern recognition robust against variations of an identification target. For example, Japanese Patent Laid-Open No. 10-021406 (patent reference 3) and Japanese Patent Laid-Open No. 2002-358500 (patent reference 4) have proposed examples applied to target identification or detection in an image.
FIG. 16 shows the logical network composition as an example of simple CNN. FIG. 16 shows an example of three-layer CNN in which the number of features of a first layer 406 is 3, that of a second layer 410 is 2, and that of a third layer 411 is 1. Reference numeral 401 denotes image data, which corresponds to raster-scanned image data. Reference numerals 403a to 403c denote feature planes of the first layer 406. The feature plane is an image data plane indicating the calculation result while scanning data of the previous layer using a predetermined feature extraction filter (the accumulated sum of convolution calculations and nonlinear processing). Since the feature plane is the detection result for the raster-scanned image data, the detection result is also expressed by a plane. The feature planes 403a to 403c are generated from the image data 401 by corresponding feature extraction filters. For example, the feature planes 403a to 403c are generated by two-dimensional convolution filter calculations corresponding to convolution filter kernels 404a to 404c, and the nonlinear transformation of the calculation results. Note that reference numeral 402 denotes a reference image area required for the convolution calculations.
For example, a convolution filter calculation having a kernel size (the length in the horizontal direction and the height in the vertical direction) of 11×11 processes data by a product-sum calculation given by:
                              output          ⁡                      (                          x              ,              y                        )                          =                              ∑                          row              =              0                        rowSize                    ⁢                                          ⁢                                    ∑                              column                =                0                            columnSize                        ⁢                                                  ⁢                                          input                ⁡                                  (                                                            x                      +                      column                                        ,                                          y                      +                      row                                                        )                                            ×                              weight                ⁡                                  (                                      column                    ,                    row                                    )                                                                                        (        1        )            where
input(x, y): a reference pixel value at coordinates (x, y)
output(x, y): a calculation result at coordinates (x, y)
weight(column, row): a weighting coefficient at coordinates (x+column, y+row)
columnSize=11, rowSize=11: a convolution filter kernel size (the number of filter taps).
Reference numerals 404a to 404c denote convolution filter kernels having different coefficients. Also, the convolution filter kernels have different sizes depending on the feature planes. The convolution filter kernels will be referred to as convolution kernels hereinafter.
The CNN calculations generate the feature plane by repeating the product-sum calculation while scanning a plurality of filter kernels for respective pixels, and by nonlinearly transforming the final product-sum result. Upon calculating the feature plane 403a, since the number of interconnections with the previous layer is 1, the number of filter kernels is 1 (404a). On the other hand, upon calculating each of feature planes 407a and 407b, since the number of interconnections with the previous layer (first layer 406) is 3, the calculation results of three convolution filters corresponding to convolution kernels 409a to 409c or 409d to 409f are accumulated. The convolution kernels 409a to 409f have different filter coefficients. The convolution kernels 409a to 409c and the convolution kernels 409d to 409f have different kernel sizes, as shown in FIG. 16. For example, the feature plane 407a can be generated by accumulating the outputs from the convolution kernels 409a to 409c, and finally executing the nonlinear transformation processing of the result.
The basic arrangement of the accumulation of convolution kernels (convolution filters) and the nonlinear transformation processing is the same as that of the neuron shown in FIG. 15. That is, the coefficients of the convolution kernel correspond to the weighting coefficients w_1 to w_n. Upon interconnecting to the feature planes of a plurality of previous layers like the feature planes 407a, 407b, and 408, the accumulation adder 32 accumulates a plurality of convolution kernel calculation results. That is, the total number of interconnections corresponds to the convolution kernel size×the number of features of the previous layer.
FIG. 17 is a view for explaining graphic detection processing in the CNN calculations. Reference numerals 51a to 51c denote convolution kernels which illustrate feature extraction targets of the first layer, and are learned to respectively extract a horizontal edge and oblique edges. Reference numerals 52a and 52b denote graphics determined based on the extraction results of a plurality of first layer features (primary features) and their spatial allocation relationships. Reference numeral 53 denotes a graphic to be finally extracted (ternary feature in this example). The graphic 53 is determined based on the extraction results of a plurality of second layer features (secondary features) and their spatial allocation relationship. Assume that the respective filter coefficients of the convolution kernels are determined for respective features by learning using a prevalent method such as perceptron learning, back propagation learning, or the like. In object detection, recognition, and the like, a filter kernel having a size as large as 10×10 or more is normally used. In general, convolution kernel sizes are different for respective features.
In this way, in the CNN calculations, by hierarchically interconnecting layers while holding the results by respective image planes for respective feature extractions, robust pattern detection based on primitive features and their spatial allocation relationships can be implemented.
As has been described using FIG. 14, in an apparatus for detecting an object in an image, which uses a general hierarchical neutral network, as the memory size required for calculation processing, a buffer memory used to hold neuron outputs suffices except for input and output image buffers. That is, if a memory having the predetermined number of bits as many as the number of neurons is provided, desired calculation processing can be executed.
On the other hand, in case of the CNN calculations, since feature extraction is made based on the spatial allocation of a plurality of feature extraction results of the previous layer, data buffers of a predetermined size are required between adjacent layers. For example, in case of the CNN calculation configuration shown in FIG. 16, an image size×five feature plane buffer memories are required in addition to input and output image buffers. For this reason, a memory size required for processing becomes larger than a general hierarchical neural network.
The methods disclosed in patent references 3 and 4 described above are also those which hold the feature extraction results by image planes, and the memory size required for processing is larger than a general hierarchical neural network.
Particularly, upon hardware implementation of the CNN calculations, a RAM (Random Access Memory) having a large size needs to be prepared in an LSI, resulting in increases in circuit scale and cost. Even upon software implementation of the CNN calculations, if it is implemented in an embedded device, the cost similarly increases due to an increase in memory size required for the system.