As applications to a pattern identification system, prediction system, control system, and the like, a signal processing apparatus using a neural network is prevalently used. The neural network is often implemented as software which runs on a microprocessor, and is provided as application software for a personal computer, workstation, and the like.
FIG. 2 is a schematic diagram showing an example of the arrangement of an image processing apparatus using a general layer-interconnected neural network. Reference numeral 21 denotes, for example, raster-scanned image data as a detection target. A calculation unit 22 comprises a neural network of three layers and detects a predetermined object from the image data 21. Reference numeral 23 denotes an output data plane corresponding to the calculation result. The calculation unit 22 executes processing while scanning and referring to a predetermined image area 24 in the image data 21, thereby detecting a detection target which exists in the image data 21. The output data plane 23 is a data plane having the same size as the image data 21 as the detection target, and stores, in the scan order, detection outputs obtained when the calculation unit 22 processes all the areas of the image data 21 while scanning them. Since the calculation unit 22 outputs a large value at a position where a target is detected, it can recognize the position of the target in the image plane by scanning the output data plane 23.
In the calculation unit 22, reference numerals 25, 26, and 27 denote layers of the neural network, and a predetermined number of neurons 28 exist in each layer. The first layer 25 has the same number of neurons 28 as the number of pixels of a reference image. Respective neurons are feedforward-interconnected via predetermined weighting coefficients.
FIG. 3 shows the arrangement of one neuron 28. Reference numerals in_1 to in_n denote input values of neurons, which are output values of the previous layer neurons in the second and subsequent layers. Multipliers 31a, 31b, . . . , 31n output products obtained by multiplying the output values of the respective previous layer neurons by coefficients w_1 to w_n obtained by learning. An accumulation adder 32 accumulates the products from the multipliers 31a, 31b, . . . , 31n. A nonlinear transformation processing unit 33 nonlinearly transforms the accumulated sum from the accumulation adder 32 using a logistic function, hyperbolic tangent function (tanh function), or the like, and outputs that transformation result as a detection result “out”. In the hierarchical neural network of this type, the weighting coefficients w_1 to w_n required for respective neurons are determined in advance for respective detection targets using a learning algorithm such as back propagation, or the like, which is generally known.
For the purpose of high-performance and low-cost implementation of such layer-interconnected neural network in an embedded device or the like, an implementation method of the layer-interconnected neural network using analog hardware or digital hardware has been proposed.
For example, Japanese Patent No. 2679730 (patent reference 1) discloses an architecture of a hierarchical structure neural network which implements a multilayered structure using single-layer analog neural network hardware as time division multiplexing. Also, Japanese Patent Laid-Open No. 3-55658 (patent reference 2) discloses an implementation method of implementing a neural network using digital hardware. On the other hand, a calculation method called Convolutional Neural Networks (to be abbreviated as CNN hereinafter) of neural networks is known as a method that allows pattern recognition robust against variations of an identification target. For example, Japanese Patent Laid-Open No. 10-021406 (patent reference 3) and Japanese Patent Laid-Open No. 2002-358500 (patent reference 4) have proposed examples applied to target identification or detection in an image.
FIG. 4 shows the logical network composition as an example of simple CNN. FIG. 4 shows an example of three-layer CNN in which the number of features of a first layer 406 is 3, that of a second layer 410 is 2, and that of a third layer 411 is 1. Reference numeral 401 denotes image data, which corresponds to raster-scanned image data. The image data 401 is input data to the CNN calculations. Reference numerals 403a to 403c denote feature planes of the first layer 406. The feature plane is an image data plane indicating the processing result obtained by calculations while scanning data of the previous layer using a predetermined feature extraction filter (the accumulated sum of convolution calculations and nonlinear processing). Since the feature plane is the detection result for the raster-scanned image data, the detection result is also expressed by a plane. The feature planes 403a to 403c are generated from the image data 401 by corresponding feature extraction filters. For example, the feature planes 403a to 403c are generated by nonlinearly transforming the calculation results the calculation results of the two-dimensional convolution filters 404a to 404c. Note that reference numeral 402 denotes a reference image area required for the convolution calculations of the convolution filters 404a to 404c. 
For example, a convolution filter calculation having a kernel size (the length in the horizontal direction and the height in the vertical direction) of 11×11 processes data by a product-sum calculation given by:
                              output          ⁡                      (                          x              ,              y                        )                          =                              ∑                          row              =                                                -                  RowSize                                /                2                                                    rowSize              /              2                                ⁢                                          ⁢                                    ∑                              column                =                                                      -                    columnSize                                    /                  2                                                            columnSize                /                2                                      ⁢                                                  ⁢                                          input                ⁡                                  (                                                            x                      +                      column                                        ,                                          y                      +                      row                                                        )                                            ×                              weight                ⁡                                  (                                      column                    ,                    row                                    )                                                                                        (        1        )            where
input (x, y): a reference pixel value at coordinates (x, y)
output (x, y): a calculation result at coordinates (x, y)
weight (column, row): a weighting coefficient at coordinates (x+column, y+row)
columnSize=11, rowSize=11: a filter kernel size (the number of filter taps).
The convolution filters 404a to 404c are convolution filter kernels having different coefficients. Also, the convolution kernels have different sizes depending on the feature planes.
The CNN calculations generate the feature plane by repeating the product-sum calculation while scanning a plurality of filter kernels for respective pixels, and by nonlinearly transforming the final product-sum result. Upon calculating the feature plane 403a, since the number of interconnections with the previous layer is 1, the number of filter kernels is 1 (convolution filter 404a). On the other hand, upon calculating each of feature planes 407a and 407b, since the number of interconnections with the previous layer is 3, the calculation results of three convolution filters 409a to 409c or 409d to 409f are accumulated. That is, the feature plane 407a can be generated by accumulating the outputs from the convolution filters 409a to 409c, and finally executing the nonlinear transformation processing of the sum.
Note that the convolution filters 409a to 409f are convolution kernels having different filter coefficients. The convolution filters 409a to 409c and convolution filters 409d to 409f have different kernel sizes, as shown in FIG. 4. The basic arrangement of the accumulation of convolution filters and the nonlinear transformation processing is the same as that of the neuron shown in FIG. 3. That is, the coefficients of the convolution kernels correspond to the weighting coefficients w_1 to w_n. Upon interconnecting to the feature planes of a plurality of previous layers like the feature planes 407a, 407b, and 408, the accumulation adder 32 accumulates calculation results of a plurality of convolution kernels. That is, the total number of interconnections corresponds to the convolution kernel size×the number of features of the previous layer.
FIG. 5 is a view for explaining graphic detection processing in the CNN calculations. Reference numerals 51a to 51c denote convolution kernels which illustrate feature extraction targets of the first layer, and are learned to respectively extract a horizontal edge and oblique edges. Reference numerals 52a and 52b denote graphics determined based on a plurality of feature extraction results of the first layer and their spatial allocation relationships. Reference numeral 53 denotes a graphic to be finally extracted. The graphic 53 is determined based on a plurality of second layer feature extraction results and their spatial allocation relationship. Assume that the filter coefficients of the convolution kernels are determined in advance for respective features by learning using a prevalent method such as perceptron learning, back propagation learning, or the like. In object detection, recognition, and the like, a filter kernel having a size as large as 10×10 or more is normally used. In general, convolution kernel sizes are different for respective features.
In this way, in the CNN calculations, by hierarchically interconnecting layers while holding the results by respective image planes for respective feature extractions, robust pattern detection based on primitive features and their spatial allocation relationships can be implemented.
As has been described using FIG. 2, in an apparatus for detecting an object in an image, which uses a general hierarchical neutral network, as the memory size required for calculation processing, a buffer memory used to hold neuron outputs suffices except for input and output image buffers. That is, if a memory having the predetermined number of bits as many as the number of neurons is provided, desired calculation processing can be executed.
On the other hand, in case of the CNN calculations, since feature extraction is made based on the spatial allocation of a plurality of feature extraction results of the previous layer, data buffers of a predetermined size are required between adjacent layers. For example, in case of the CNN calculation configuration shown in FIG. 4, the five feature planes 403a to 403c, and 407a and 407b (buffer memories) are required. That is, a memory size of an image size×5 is required in addition to input and output image buffers. For this reason, a memory size required for processing becomes larger than a general hierarchical neural network.
The methods disclosed in patent references 3 and 4 described above are also those which hold the feature extraction results by image planes, and the memory size required for processing is larger than a general hierarchical neural network.
As a result, particularly, upon hardware implementation of the CNN calculation configuration, a RAM (Random Access Memory) having a large size needs to be prepared in an LSI, resulting in increases in circuit scale. Even upon software implementation of the CNN calculation configuration, if it is implemented in an embedded device, the cost similarly increases due town increase in memory size required for the system.
On the other hand, as a method of avoiding an increase in memory size, a method of dividing input data into areas and inputting the divided data is used. However, when calculations with a broad reference area are to be hierarchically processed, input data needs to be divisionally input by overlapping the input data over a broad range, processing target areas increase, resulting in a processing speed drop.