1. Field of the Invention
The present invention relates to a digital data processing scheme. Particularly, the present invention relates to an apparatus that recognizes a specific signal included in digital data, or more specifically, an apparatus that detects a specific object of shooting from digital image data, and a system and method associated with the apparatus.
2. Description of the Related Art
In recent years, the spread of digital cameras, video, or digital camera functions in mobile phones has facilitated creation of digital image data. For example, a technique of detecting faces from digital image data is generally known. In shooting with a digital camera, for example, this face detection technique is applied to exposure, a focus position, or adjustment of light for stroboscopic light emission. The face detection technique is also applied, such as in image printing, to detecting faces in an image and adjusting the brightness and tones of the entire image so that the brightness and colors of the detected face areas become appropriate. These functions are incorporated into products and put on the market.
Another application of the face detection technique is roughly classifying images into human images and landscape images, for example. By classifying in this manner, this technique can be used as means for automatically adding bibliographic information (metadata) to images. That is, the face detection technique is applied to each image to obtain information about “how many faces of which sizes are present at which positions in the image,” and based on this information, each image is classified or retrieved.
Image processing methods to automatically detect a particular pattern of an object of shooting from an image are very useful and can be used to determine a human face, for example. Such image processing methods can be used in many fields such as teleconferencing, man-machine interfaces, security, monitor systems for tracking a human face, and image compression.
For example, a non-patent document 1, “Yang et al, “Detecting Faces in Images: A Survey”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002” describes various schemes for the technique of detecting faces from an image. Among others, the document indicates a scheme in which human faces are detected by utilizing several noticeable features (such as two eyes, a mouth, and a nose) and unique geometric position relationships among the features, or by utilizing symmetrical features of human faces, complexional features of human faces, template matching, a neural network, and the like.
A scheme proposed in a non-patent document 2, “Rowley et al, “Neural network-based face detection”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 1, JANUARY 1998” is a method of detecting facial patterns in an image by using a neural network. The face detection method according to the non-patent document 2 will be briefly described below.
First, image data from which faces are to be detected is read into memory, and a predetermined area to be matched with faces is clipped out from the read image. A pixel value distribution in the clipped-out area is input, and an output is obtained with neural network-based operations. Here, weights and threshold levels for the neural network are learned in advance with a vast number of facial image patterns and non-facial image patterns. For example, a face is identified if the output of the neural network is not smaller than 0: otherwise, non-face is identified. Positions for clipping out an image pattern to be matched with faces as inputs of the neural network are sequentially scanned across the entire image area horizontally and vertically, so that faces are detected from the image. In order to address detection of various sizes of faces, the read image is successively scaled down by predetermined factors to perform the above face detection scan for the scaled-down images.
A further example that focuses the attention on speedup of processing is a non-patent document 3, “Viola and Jones, “Rapid Object Detection using Boosted Cascade of Simple Features”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01)”. In this report, AdaBoost is used to increase the face discrimination accuracy with an effective combination of many weak discriminators. On the other hand, each weak discriminator is configured with a Haar-type rectangle feature amount, and further an integral image is used to calculate the rectangle feature amount at a high speed. Also, the discriminators obtained with AdaBoost learning are serially connected to configure a cascade face detector. This cascade face detector first uses a simple (that is, with a smaller amount of computation) discriminator at a preceding stage to remove candidates of patterns that are obviously not a face on the spot. For only the remaining candidates, a complex (that is, with a larger amount of computation) discriminator with higher identification performance at a following stage is used to determine whether or not each candidate is a face. Therefore, the processing is fast because there is no need to perform complex determination for all the candidates.
However, the above known examples all apply the face detection to still images and not to detection from moving images.
Japanese Patent Laid-Open No. 2005-174352 employs a method in which, in order to detect faces from a moving image in real time, a temporally unchanged area is determined and excluded from a face detection process. Although this method is effective for speedup, the method does not integrate face identification results across a plurality of frames together. Therefore, improvement in accuracy cannot be expected.
The weights and threshold levels for the neural network in the non-patent document 2, or parameters for defining the rectangle feature amount referred to by the weak discriminators and operational coefficients and threshold levels for performing a discrimination process from the rectangle feature amount in the non-patent document 3 are generally called a recognition dictionary. The recognition dictionary is usually data of a size of several dozen KB to several hundred KB.
A method of adding the metadata is disclosed in Japanese Patent Laid-Open No. 2004-221753. In this method, information about the name of a shooting location and image information are transmitted from a data server placed at the shooting location to a camera via Bluetooth, and the camera stores images in association with the location information.
In photographic images and home video, human faces are indeed important objects of shooting. All the more because of this, for example, if 90% of a group of images consists of images that include faces, it can be readily understood that whether a face is present or not is insufficient as bibliographic information. If one prefers taking photographs (video) of landscapes, whether a person is present or not does not matter in the first place. As such, again it can be understood that whether a person is present or not is insufficient as information for distinction from other information.
Therefore, objects of recognition may not to be limited to faces and human bodies but may expand to various things, for example dogs, cats, cars, and the like.
However, such expansion of objects of recognition involves the processing load multiplied by “the number of expected objects of recognition” compared to the conventional detection limited to faces.
To solve this problem, it may be possible to increase the configuration or enhance the performance as the types of objects of recognition increase. However, this would result in a large configuration, leading to an expensive apparatus.
It may also be possible to perform processing with the configuration remaining unchanged, but the processing time would increase as the types of objects of recognition increase. Especially, since the processing responsiveness of digital cameras and video illustrated above would decrease, their usability would be impaired.
From another viewpoint, if the recognition apparatus is configured as a battery-driven mobile device, a problem arises in that the power consumption increases as the processing load increases. That is, an increase in power consumption causes a reduced operating time of the device, and to avoid this, it may be possible to have a large-capacity battery. However, the weight of the entire device would increase, leading to a reduced portability.
Japanese Patent Laid-Open No. 2004-221753 discloses that information about the name of a shooting location and image information are transmitted from a data server placed at the shooting location to a camera via Bluetooth, and the camera stores images in association with the location information. However, what is recorded is the shooting location, and what are actually captured in the images is not available.