1. Field of the Invention
The present invention relates to a data processing apparatus in which a plurality of processing modules are connected in series and cascade processing is performed in which it is determined whether or not a subsequent processing is executed depending on a current processing result, and a control method thereof.
2. Description of the Related Art
Generally, technology has been proposed, for use in digital cameras and printers, to detect a particular object such as a person or a face in an input image and performs processing suitable for the detected object. Face detection processing for performing skin color correction processing on the face is an example of detecting a particular object. A variety of methods have been proposed for human face detection processing such as the method (called Viola & Jones method) proposed by P. Viola and M. Jones, in “Robust Real-time Object Detection”, SECOND INTERNATIONAL WORKSHOP ON STATISTICAL AND COMPUTATIONAL THEORIES OF VISION, Jul. 13, 2001 (hereinafter referred to as Document 1), and methods that utilize symmetrical features of the human face, template matching, neural networks and the like.
With the Viola & Jones method, a plurality of identification processes are executed based on the results (feature amounts) of Adaboost learning. These identification processes are implemented by cascade processing, and each identification process outputs, as a result of having performed identification, either True when the next identification process is to be performed or False when the next identification process is not to be performed. If the identification process is False, the identification process ends. FIG. 15A shows example feature amounts obtained as a result of the learning used in such processing. A feature amount 210 exhibits the feature that, when a small rectangular portion around the eyes is compared with a portion beneath the eyes (cheek portion), the portion around the eyes is darker than the portion beneath the eyes. A feature amount 211 exhibits the feature that, in the portion around the eyes, the portion of each eye is dark and the glabellar portion between the eyebrows is lighter than the portion of each eye. Input data are compared to such results of learning (learnt feature amounts), and if True is output for all of the feature amount identification processes, it is determined that the input data indicate a (human) face.
Also, with the Viola & Jones method, identification processing is sectioned into sections (hereinafter referred to as stages), True/False identification is performed for each stage, and thereby identification of face or non-face is performed. Also, earlier stages use only a simple feature so that the probability of a false negative (determination of a face as non-face, or an oversight) is minimized and the probability of a false positive (determination of a non-face as face, or an erroneous detection) is relatively high. Using only simple features enables identification processing with a reduced number of computations, and thus high-speed processing is possible even when the processing is performed using processors. Furthermore, in order to detect a face existing in a part of an entire image, a rectangular region is clipped from the entire image to identify the clipped region. According to the above-described method, more rectangular regions can be efficiently identified as False (non-face) in earlier stages, and thus the face detection processing over the entire image can be completed in a short time.
When rectangular regions are clipped from an entire image to carry out the identification processing sequentially on the rectangular regions, several methods are conceivable that could determine the order in which rectangular regions are clipped. A widely used method in which scanning is performed pixel by pixel in the main scanning direction (horizontal direction) with the pixel on the upper left of the image set as a starting point. This scanning method will be described with reference to FIG. 14A. Strip-shaped regions created by dividing an input image in the main scanning direction by the height of a rectangular region on which the identification processing is performed are called bands. In FIG. 14A, Band_A is a band whose top corner is the pixel on the upper left of the input image. Band_a is a band whose top corner is the pixel at a position shifted in the sub-scanning direction (vertical direction) by one pixel from the top corner of Band_A. In this scanning method, first, the pixel on the upper left of the input image is set as a starting point, and identification processing is performed on a rectangular region (subwindow) in which the upper left pixel of the rectangular region coincides with the starting point. Next, the identification processing is performed sequentially on rectangular regions at positions each shifted by one pixel in the main scanning direction until the right edge of a rectangular region coincides with the right edge of the input image. The processing on Band_A is completed at this time. Next, the pixel at a position shifted by one pixel in the sub-scanning direction from the starting point used when Band_A was processed is set as a starting point, and the identification processing is performed sequentially on rectangular regions at positions each shifted by one pixel in the main scanning direction until the right edge of a rectangular region coincides with the right edge of the input image. The processing on Band_a is completed at this time. After that, the processing is performed on each band with a shift by one pixel in the sub-scanning direction until the lower edge of a rectangular region coincides with the lower edge of the input image.
The transition of the identification result from False to True and then from True to False as scanning proceeds in an initial stage (stage 0) of the identification processing, when sequential identification processing is performed using the scanning method described above, will be described with reference to FIGS. 15A to 15E. It is assumed that in the stage 0, the identification processing is performed using the feature amount 210 shown in FIG. 15A. FIGS. 15B to 15E are diagrams showing relative positions between the feature amount 210 and a face portion when, with respect to a face portion of the input image, rectangular regions are scanned in the main scanning direction. At a rectangular region position shown in FIG. 15C, the face is positioned substantially at the center of rectangular regions, and thus True (likely to be a face) is determined as a result of comparison against the feature amount 210. FIGS. 15B and 15D respectively show left and right edge rectangular regions that are determined to be True (likely to be a face) as a result of comparison against the feature amount 210. In other words, a rectangular region at the position shifted by one pixel to the left from FIG. 15B is determined as False (non-face) and a rectangular region at the position shifted by one pixel to the right from FIG. 15D is also determined as False (non-face) as a result of identification. FIG. 15E shows the transition of the identification result from False to True and then from True to False as scanning proceeds in FIG. 15A to FIG. 15D.
When the identification processing is performed with shifting the rectangular region little by little, as described above, the identification result repeatedly transitions from False to True and then from True to False as scanning proceeds. On this occasion, the frequency of occurrence of True and False varies according to the density of face portions included in the input image. How the frequency of occurrence of True and False varies in the stage 0 due to the density of face portions will be described with reference to FIGS. 16A to 16C. In FIGS. 16A to 16C, T is shown in the upper left of a rectangular region whose identification result was True, and F is shown in the upper left of a rectangular region whose identification result was False.
FIG. 16A shows an example in which there is only one face portion within one band. Nine Ts (True) are in succession and thereafter 27 Fs (False) are in succession with the progress of scanning. FIG. 16B shows an example in which two face portions are spaced apart from each other within the same band. Nine Ts (True) are followed by 6 Fs (False), and further 9 Ts (True) are followed by 6 Fs (False). FIG. 16C shows an example in which two face portions are adjacent to each other within the same band. Nine Ts (True) are followed by one F (False), and further 9 Ts (True) are followed by one F (False).
It can be seen from the foregoing that whichever of True and False, which are the output results from discriminators used in the face detection processing, has a higher frequency of occurrence depends on the density of face portions included in the input image. In the face detection, the identification processing is sectioned into stages, and True or False is determined for each stage. Hereinafter, the probability of occurrence of True in each stage is referred to as “passage rate”. In the case of FIG. 16A, the passage rate of the stage 0 is calculated from the ratio between T (True) and F (False) to be 1/4. Likewise, the passage rate is 3/5 in the case of FIG. 16B, and the passage rate is 9/10 in the case of FIG. 16C.
Next, a description will be given of the relationship between the passage rate of each stage and the probability (accumulated passage rate) that True is returned as an identification result successively from the initial stage to a particular stage in cascade processing in which a plurality of stages are connected in series.
The total number of processes (the number of input rectangular regions) of the first or leading stage of the identification processing is defined as S. Only the rectangular regions identified as True in the first stage of the identification processing, which is the preceding stage, are input to the next second stage of the identification processing. Accordingly, the data amount, or in other words, the number of rectangular regions, processed by the second stage of the identification processing will be the product (S*p[1]) obtained by multiplying the number of rectangular regions processed by the first stage of the identification processing by the passage rate p[1] of the first stage of the identification processing. Furthermore, the data amount, or in other words, the number of rectangular regions, processed by the third stage of the identification processing amounts to the product, (S*p[1])*p[2], obtained by multiplying the number of rectangular regions processed by the second stage of the identification processing by the passage rate p[2] of the second stage of the identification processing. Hereinafter, with the same calculation, the data amount, or in other words, the number of rectangular regions, processed by the Nth stage of the identification processing can be represented as follows:S*p[0]*p[1]* . . . *p[N−2]*p[N−1].
Hereinafter, p[0]*p[1]* . . . *p[N−1] is referred to as the accumulated passage rate P[N] of the identification processing in the stage N. Also, P[0]=1 because all of the input data is input to a discriminator in the first stage (the data is input to the discriminator in the first stage with a passage rate of 100%).
As described earlier, the passage rate varies depending on the type of input image and the processing position within the image (the position of a rectangular region to be processed). In other words, the passage rate of an image having a high face density such as a group photograph generally is higher than the passage rate of an image having a low face density such as a landscape photograph. Also, even in a group photograph, in the case of the input image having a landscape in the upper portion of the photograph and people in the lower portion of the photograph, the face density is higher in the lower portion of the photograph. Accordingly, the passage rate during identification processing on the lower portion (people portion) of the photograph having a high face density is generally higher than the passage rate during identification processing on the upper portion (landscape portion) of the photograph having a low face density.
How the accumulated passage rate varies depending on the type of input image and the processing position within the input image will be described specifically with reference to FIGS. 14A, 14B and 14C. FIG. 14A is an example of a group photograph including a relatively large number of face portions in the input image. The average accumulated passage rate at each stage is plotted in a graph shown in FIG. 14C for Band_A, Band_B, Band_C and Band_D shown in FIG. 14A. In bands having a low face density such as Band_A, almost all of the rectangular regions are determined as non-face by the identification processing of the stage 0, and thus the average accumulated passage rate in the stage 1 is substantially 0%. On the other hand, in bands having a high face density such as Band_D, a large number of rectangular regions are determined as likely to be a face in all of the stages, and thus the average accumulated passage rate in the stage 2 is 50% or more. As can be seen from the foregoing, even in an input image having a high face density overall, the accumulated passage rate varies significantly depending on the processing position.
On the other hand, FIG. 14B is an example of a group photograph including a smaller number of face portions in the input image than the group photograph of FIG. 14A. In the graph of FIG. 14C, the average accumulated passage rate at each stage is also plotted for Band_X, Band_Y and Band_Z shown in FIG. 14B. The average accumulated passage rate in Band_X is similar to that of Band_A of FIG. 14A, but in Band_Z having the highest face density in FIG. 14B, the average accumulated passage rate at the stage 2 is below 50%. In other words, in different input images as shown in FIGS. 14A and 14B, the average accumulated passage rate varies significantly even at the same processing position.
The identification processing as typified by the Viola & Jones method is implemented by the multistage cascade processing composed of a plurality of stages, and by determining more rectangular regions as non-face in earlier stages, high-speed processing is achieved. However, as described above, the probability that non-face is determined in each stage varies significantly depending on the type of input image and the processing position within the input image.
Recently, more and more digital cameras and the like are equipped with a face detection function. In the future, in addition to simply incorporating such a function, demand will also increase for high-speed processing so that the face detection processing can be performed in real-time during image capture. General methods for speeding up not only the identification processing but also data processing include increasing the operating frequency, and internally providing a FIFO or RAM in order to prevent rate-limiting in transfer of input/output data. Also, methods for temporally/spatially parallelizing the processing are widely used. With temporally parallel processing (pipeline processing), dedicated discriminators are mounted and connected in cascade manner for stages executed in series so that the discriminators mounted for the stages can be simultaneously operated in parallel, and therefore high-speed processing can be achieved. However, the longest of the processing times of the stages rate-limits the overall processing time. Accordingly, provided that, in all of the stages, the passage rate is 100% and the processing times are uniform, the processing speed can be increased by an amount corresponding to the number of stages (by 4 times if there are 3 stages).
Spatially parallel processing is a speed-up technique in which, in order to further speed up the above-mentioned pipeline processing, a plurality of pipelines are mounted to simultaneously process a plurality of input data pieces. With the spatially parallel processing, if input data can be supplied smoothly to each pipeline processing, the processing speed can be increased by the amount of spatial parallelization (by 4 times if 4 pipelines are mounted). Accordingly, with a configuration in which 4 pipelines, each having 3 stages, are mounted using 12 discriminators, theoretically, the processing speed can be increased by 12 times.
As described above, in order to speed up the identification processing in face detection, according to the conventional technology, the temporally parallel processing and the spatially parallel processing are combined to achieve a performance improvement. For example, the conventional technology tries to, by mounting 12 discriminators, improve performance by an amount corresponding to the number of pipeline stages×the degree of spatial parallelism (12 times in the above example) compared to the configuration in which one discriminator is mounted.
However, as described earlier, the average accumulated passage rate varies greatly depending on the type of input image and the processing position within the input image. When the face density is high, it is possible to improve the performance by an amount close to the amount corresponding to the number of pipeline stages×the degree of spatial parallelism, but when the face density is low, the performance improvement does not come close to the amount corresponding to the number of pipeline stages×the degree of spatial parallelism. In other words, the speed-up technique using temporally/spatially parallel processing according to the conventional technology is problematic in that sufficient performance improvement cannot be achieved depending on the passage rate, and also in that the performance varies significantly depending on the type of input image and the processing position within the input image.
The performance degradation and performance variation are caused by a situation in which when the average accumulated passage rate in a stage decreases due to a variation, data supply to the subsequent stage is interrupted, as a result of which the discriminators mounted for the subsequent stage do not operate. The situation in which the discriminators do not operate will be described in detail, taking the case in which the images of FIGS. 14A and 14B are processed with a configuration in which 4 pipelines, each having 3 stages, are mounted using 12 discriminators. FIGS. 17A to 17D and FIGS. 17F and 17G are schematic diagrams respectively showing the average operation state of the discriminators when the identification processing is performed on Band_A, Band_X, Band_B, Band_C, Band_D, Band_Y and Band_Z. FIGS. 17E and 17H are schematic diagrams respectively showing the average operation state of the discriminators at the average passage rate in the image shown in FIG. 14A and at the average passage rate in the image shown in FIG. 14B. It should be noted that the following description assumes that the processing time is the same in all of the discriminators.
In FIGS. 17A to 17H, non-hatched circles indicate discriminators (modules) that are constantly operated, and hatched circles indicate modules that are operated or shut down depending on the processing result in the preceding stage. Also, cross-hatched circles indicate modules that are constantly shut down. If the average accumulated passage rate P[N] of the identification processing in the stage N is determined for each band from the above-mentioned graph shown in FIG. 14C, in Band_A, P[1]=10% and P[2]=2% are obtained. Accordingly, 4 discriminators are constantly operated in the stage 0, but 3 discriminators are constantly shut down in each of the stage 1 and the stage 2, with the only discriminator in operation in the stage 1 having an operating ratio of 40% and the only discriminator in operation in the stage 2 having an operating ratio of 0.8%. Therefore, in Band_A, it is only possible to acquire performance approximately 4.5 (=4+0.4+0.08) times greater. On the other hand, in Band_D, high average accumulated passage rates are obtained with P[1]=92% and P[2]=90%, and therefore in both the stage 1 and the stage 2, almost all of the discriminators are constantly operated. As a result, the processing speed can be increased by approximately 11.3 (=4+4×0.92+4×0.9) times, close to the target performance. However, the average accumulated passage rate of the entire image of FIG. 14A is P[1]=50%, p[2]=40%, and therefore the processing speed can be increased only by approximately 7.6 (=4+4×0.5+4×0.4) times.
When each band is analyzed in the manner described above, in Band_D of FIG. 14A, because the average accumulated passage rate in each stage is high, the performance is improved by approximately 11.3 times, which is close to the target value of 12 times. However, the performance is improved by only approximately 8.4 times in Band_C, by only approximately 5.8 in Band_B, and by only approximately 4.5 times in Band_A. Consequently, in the entire image, the performance is improved by only approximately 7.5 times. Likewise, in the image shown in FIG. 14B, in Band_X, the performance is improved by only approximately 4.5 times (the same as Band_A of FIG. 14A), by only approximately 4.9 times in Band_Y, and by only approximately 6.8 times in Band_Z, and in the entire image, the performance is improved by only 4.7 times, which is even lower than FIG. 14A.
The above description was given assuming that the processing time was the same in all of the discriminators, but in practice, each stage has a different processing load. For this reason, there is a possibility that rate-limiting of processing might occur (the longest of the processing times of the stages might rate-limit the overall processing time) in the temporally parallel processing described earlier, and this may cause a further performance degradation.
The identification processing of discriminators for each stage are defined by feature amount used for the identification. Therefore, if feature amounts and connection relationship among the discriminators can be changed, assignment of discriminators to each stage can be adjusted to disperse loads. Conventionally, various dynamic load balancing methods have been proposed in order to improve and stabilize the processing performance by making the operation ratios of the processors uniform. For example, Japanese Patent Laid-Open No. 2003-256221 (hereinafter referred to as Document 2) presents the following proposal. Specifically, processes generated by parallel programs are assigned to processing timeslots of a plurality of processors according to the time corresponding to the processor distribution ratio preset for each parallel program. It is then determined whether a plurality of parallel processes generated by a parallel program can be assigned to idle timeslots, to which no process has been assigned, of the processing times of the processors so as to be capable of parallel operation. If parallel operation is possible, other parallel processes are additionally assigned to the idle timeslots, and the processors execute the parallel processes assigned to the processing timeslots of the processors.
However, according to the technique of Document 2, processes that require a turn-around time guarantee are assigned to predetermined timeslots, and a plurality of parallel processes capable of parallel operation are additionally assigned to idle timeslots, whereby the operating ratios of the processors are improved while the turn-around time is guaranteed. However, Document 2 only gives consideration to the case where processes having predetermined loads are processed. In other words, none of the conventional technology performs control focusing on the passage rate and processing time of each stage. Accordingly, the data processing (so-called cascade processing) controlling a plurality of processes, in which whether or not to execute the next processing is determined based on a processing result, such as the face detection according to the Viola & Jones method, is disadvantageous in that, when the load (execution time) of processing (process) varies depending on the input data, the effect of suppressing the performance degradation and the performance variation is small.