In the information processing field, multi-dimensional array information is frequently handled. Partial processing, statistical processing, and the like associated with image processing, image recognition and image composition, and the like often calculate and use the sum total value of elements within a range of a specific region.
For this purpose, a spreadsheet application such as Microsoft Excel® or the like as an example of an application which implements information processing has a function of calculating a sum of elements within a designated rectangle in a two-dimensional table. A programming language for calculations such as MathWorks MATLAB® includes a function of calculating the sum of elements of a matrix.
In the field of computer graphics, F. C. Crow has proposed the concept called “summed-area table” of accumulated image information for original input image information (F. C. Crow, “Summed-Area Tables For Texture Mapping”, Computer Graphics, 1984., “Crow84” hereinafter). In this reference, assuming that a summed-area table is a two-dimensional array having the same size (the same number of elements) as an input image, and letting I(x, y) be a pixel value at a coordinate position (x, y) of the input image, a component C(x, y) at the same position (x, y) of the summed-area table is defined as:
                              C          ⁡                      (                          x              ,              y                        )                          =                              ∑                                                            x                  ′                                ≤                x                                                              y                  ′                                ≤                y                                              ⁢                      I            ⁡                          (                                                x                  ′                                ,                                  y                  ′                                            )                                                          (        1        )            That is, as shown in FIG. 4, the sum total value of pixels in a rectangle defined by an origin point position (0, 0) and a position (x, y) as diagonal positions on an original input image 4a is a value C(x, y) at the position (x, y) of a summed-area table 4b. (Note that an original summed-area table of Crow84 has the lower left corner of an image as an origin point position, but the upper left corner is used as an origin point to interface with the following description.)
According to this definition, the sum of pixel values I(x, y) in an arbitrary region horizontally or vertically allocated on an input image can be calculated with reference to only the values of four points on the summed-area table using the following equation. For example, as shown in FIG. 5, a sum total C(x0, y0; x1, y1) of pixel values in a rectangular region defined by (x0, y0) and (x1, y1) as diagonal points can be calculated by:C(x0,y0;x1,y1)=C(x0−1,y0−1)−C(x0−1,y1)−C(x1,y0−1)+C(x1,y1)  (2)In this way, the sum total of values in an arbitrary rectangular region on an image can be calculated at high speed.
On the other hand, in the field of image recognition, accumulated image information equivalent to the summed-area table is called “Integral image”. Also, a face detection apparatus that cascade-connects weak classifiers each including a plurality of rectangular filters has been proposed (for example, P. Viola, M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Vol. 1, pp. 511-518, December 2001., “Viola01” hereinafter).
Furthermore, based on the idea of Viola01, face extraction from successive frames in real time (for example, see Japanese Patent Laid-Open No. 2004-185611), facial expression recognition (for example, see Japanese Patent Laid-Open No. 2005-44330), and instruction input by means of gestures of a face (for example, see Japanese Patent Laid-Open No. 2005-293061), and the like have been proposed.
A pattern recognition method described in Viola01, which can be applied as subsequent processing in an embodiment of the present invention (to be described later), will be described in detail below.
In Viola01, as shown in FIG. 8, a rectangular region 801 (to be referred to as “processing window” hereinafter) having a specific size is moved in an image 800 to be processed, and it is determined if the processing window 801 at each moving destination includes a human face.
FIG. 9 shows the sequence of face detection processing executed in Viola01 in the processing window 801 at each moving destination. The face detection processing in a certain processing window is executed in a plurality of stages. Different combinations of weak classifiers are assigned to the respective stages. Each weak classifier detects a so-called Haar-like feature, and comprises a combination of rectangular filters.
As shown in FIG. 9, the different numbers of weak classifiers are assigned to the respective stages. Each stage determines, using a weak classifier or classifiers of a pattern or patterns assigned to itself, if the processing window includes a human face.
An order of execution of the determination processing is assigned to the respective stages, which execute processes in cascade according to that order. That is, in, for example, FIG. 9, the determination processing is executed in the order of the first stage, second stage, and third stage.
If it is determined in a given stage that the processing window at a certain position does not include any human face, the processing is aborted for the processing window at that position to skip the determination processing in the subsequent stages. If it is determined in the final stage that the processing window includes a human face, it is determined that the processing window at that position includes a human face.
FIG. 10 is a flowchart showing the sequence of the face detection processing. A practical sequence of the face detection processing will be described below with reference to FIG. 10.
In the face detection processing, the processing window 801 to be processed is allocated at an initial position on a face detection target image 800 (step S1001). Basically, this processing window 801 is moved from the end of the face detection target image 800 at predetermined intervals in turn in the vertical and horizontal directions. With this movement, the entire image is comprehensively selected. For example, by raster-scanning the face detection target image 800, the processing window 801 is selected.
The determination processing as to whether or not the selected processing window 801 includes a human face is executed. This determination processing is executed in a plurality of stages, as described above using FIG. 9. For this reason, a stage that executes the determination processing is selected in turn from the first stage (step S1002).
The selected stage executes the determination processing (step S1003). In the determination processing of this stage, an accumulated score (to be described later) is calculated, and it is determined if the calculated accumulated score exceeds a threshold set in advance for each stage (step S1004). If the accumulated score does not exceed the threshold (No in step S1004), it is determined that the processing window does not include any human face (step S1008), and the processes in step S1007 and subsequent steps are executed. The processes in step S1007 and subsequent steps will be described later.
On the other hand, if the accumulated score (to be described later) exceeds the threshold (Yes in step S1004), it is determined if that determination processing (step S1003) is executed by the final stage (step S1005). If that determination processing is executed not by the final stage (No in step S1005), the process returns to step S1002 to select the next stage, and to execute the determination processing by the newly selected stage. On the other hand, if the determination processing is executed by the final stage (Yes in step S1005), it is finally determined that the current processing window includes a human face (step S1006). At this time, it is determined that this processing window includes a human face.
It is then determined if the processing window that has undergone the determination processing is that at the last position in the face detection target image (step S1007). If the processing window is not that at the last position (No in step S1007), the process returns to step S1001 to move the processing window to the next position and to execute the processes in step S1002 and subsequent steps. If the processing window is that at the last position, the face detection processing for this input image which is to undergo face detection.
The contents of the determination processing in each stage will be described below. One or more patterns of weak classifiers are assigned to each stage. This assignment is executed by a boosting learning algorithm such as AdaBoost or the like in learning processing. Each stage determines based on the weak classifier patterns assigned to itself if the processing window includes a face.
In each stage, feature amounts in a plurality of rectangular regions in the processing window are calculated based on the weak classifier patterns assigned to that stage. The feature amount used in this case is a value calculated using the sum total of pixel values in each rectangular region such as the total, average, and the like of the pixel values in the rectangular region. The sum total value in the rectangular region can be calculated at high speed using the accumulated image information (Summed-area table or Integral Image) for an input image, as has been described using FIG. 5 in association with Crow84.
As a relative value of the calculated feature amounts (for example, a ratio or difference; a difference used in this case), the difference is calculated, and it is determined based on this difference if the processing window includes a human face. More specifically, whether or not the calculated difference is larger or smaller than a threshold set in the weak classifier pattern used in determination. According to this determination result, whether or not the processing window includes a human face is determined.
However, the determination result at this time is obtained based on each weak classifier pattern, but it is not the result of the stage. In this manner, in each stage, the determination processes are individually executed based on all the assigned weak classifier patterns, and the respective determination results are obtained.
Next, an accumulated score in that stage is calculated. Individual scores are assigned to the weak classifier patterns. If it is determined that the processing window includes a human face, a score assigned to the weak classifier pattern used at that time is referred to, and is added to the accumulated score of that stage. In this way, the sum total of scores is calculated as the accumulated score of that stage. If the accumulated score in this stage exceeds a specific threshold (accumulated score threshold), it is determined in this stage that the processing window is likely to include a human face, and the process advances to the next stage. On the other hand, if the accumulated score in this stage does not exceed the accumulated score threshold, it is determined in this stage that the processing window does not include any human face, and the cascade processing is aborted.
In Viola01, high-speed pattern identification represented by face detection is implemented in such sequence. Note that a detector in FIGS. 9 and 10 can be used as a pattern identifier for identifying objects other than faces if it has undergone appropriate learning in advance.
Upon generating the aforementioned accumulated image information (summed-area table or integral image) from input image information, the bit precision and size (that of a temporarily holding area) of a storage buffer are normally determined based on a worst value that may be calculated. That is, let Ximg be the width (the number of pixels in the horizontal direction) of input image information, and Yimg be the height (the number of pixels in the vertical direction). Also, the bit precision of each pixel is expressed by Nimg bits (N is a positive integer). Then, the worst value Cmax is a sum total value of all pixels:
                              C          max                =                                            ∑                                                0                  ≤                  x                  <                                      X                    img                                                                    0                  ≤                  y                  <                                      Y                    Hmg                                                                        ⁢                          I              ⁡                              (                                  x                  ,                  y                                )                                              =                                    I              max                        ⁢                          X              img                        ⁢                          Y              img                                                          (        3        )            when all pixel values assume a maximum value:Imax=(2Nimg−1)Therefore, a bit precision Nbuf per element of the buffer that stores the accumulated image information needs to be a bit precision Nbuf—max that can store Cmax and assumes a value considerably larger than Nimg although it depends on the image size. For example, when an 8-bit Grayscale image of a VGA size is an input image, Nimg=8, Ximg=640, and Yimg=480. Therefore, a buffer haying Cmax=78336000=4AB5000h, in other words, Nbuf=Nbuf—max=27-bit precision (size) needs to be assured. When pieces of accumulated image information for the entire region of input image information must be simultaneously held, a memory area such as a RAM or the like as many as Nbuf—max×Ximg×Yimg=8294400 bits must be assured, and the processing resources are encumbered.
Hence, the bit precision Nbuf of the buffer needs to be reduced by an arbitrary method. Especially, the reduction of Nbuf is a deep problem since the work memory size leads directly to the circuit scale upon hardware implementation of the processing based on such accumulated information. Even upon software implementation of that processing, if Nbuf can be reduced, a smaller size can be used, thus suppressing the resource consumption amount.
Crow84 describes one method of reducing the bit precision Nbuf of the buffer. That is, input information is divided into blocks each defined by 16 pixels×16 pixels, and a Summed-area table is independently calculated for each block. If the input information has the bit precision Nimg=8 bits, the required bit precision of a buffer at that time is 16 bits. In addition, a 32-bit value of an original Summed-area table corresponding to a pixel position that neighbors in the upper left oblique direction toward the upper left corner is held for each block. In order to recover a value corresponding to a desired position, the 32-bit value held by the block including that position can be added to the 16-bit value of that position. (However, such calculations do not suffice to recover the original Summed-area table value in practice.)
However, in the conventional method, the sum total value of a desired region can be calculated by simple additions and subtractions like equation (2) with reference to four points, while a calculation for recovering the value of each point is added, resulting in a considerably heavier calculation load. Upon hardware implementation of the processing, the circuit scale for calculations is increased. Even upon software implementation of the processing, the processing speed lowers.