1. Field of the Invention
The present invention relates to a technique for extracting a representative frame, which straightforwardly represents video contained in moving image data, from the moving image data.
2. Description of the Related Art
The widespread use of digital cameras and digital video camcorders, or the like, in recent years has made it possible for large quantities of still images and moving images to be captured even by individuals. Since moving image data generally involves an enormous amount of data and includes a time axis unlike the case with still images, it is difficult to ascertain the content of such data in simple fashion. Consequently, if the user wishes to ascertain an overview of content or searches for a desired scene, the user operates the apparatus to perform a fast-forward or fast-rewind operation. Accordingly, in order to ascertain the content of a moving image in a short period of time, a technique relating to extraction of a representative frame has been proposed for selecting an appropriate frame, which well represents the content of the moving image, from a frame within the moving image, and handling the selected frame as information indicating the content of the moving image.
There is a growing tendency for digital cameras purchased for use in the ordinary home to be used for recording domestic events and happenings such as for recording the growth of children. Often, therefore, the subject shot using such a camera is a person.
Accordingly, if it is desired to ascertain the content of a moving image in home video, it is very important that a frame in the moving image that is selected as a representative image be one that makes known information as to who appears in the image. Further, since a domestic event or happening, or the like, is the main theme, if a plurality of persons appear, it is preferred to have information indicating that these persons appear simultaneously.
The specification of Japanese Patent Laid-Open No. 2005-101906 (Patent Document 1) discloses a technique which directs attention toward persons who appear in a moving image and splits the scenes of the moving image if the makeup of the persons that appear changes. When this is done, information indicating the makeup of the persons is assigned to each scene and this information is indexed. By utilizing this information indicating the makeup of the persons, a scene can be searched for and retrieved.
Further, the specification of Japanese Patent No. 3312105 (Patent Document 2) discloses a representative frame extraction technique that is based upon the statistical features of an object. Specifically, the image of an object desired to be detected is learned in advance, an evaluation value is found using a dictionary on a per-object basis by a prescribed method, and a moving image index is generated based upon the evaluation value of each object obtained frame by frame. Furthermore, the specification of Japanese Patent Laid-Open No. 2006-129480 (Patent Document 3) discloses a technique that correlates and utilizes the image of a face a speaking individual who is the object of image capture during a conference and the time interval (timeline) of the spoken voice information.
However, with the technique described in Patent Document 1, face detection is performed frame by frame and a representative image is extracted based solely upon the result. In a case where the subject is moving, however, there is a possibility the face detection will fail. In other words, since the face is not always pointed toward the camera, there are instances where a person cannot be detected even if the person is present as the subject. Such a phenomenon is conspicuous especially in a case where a child is the subject. Further, this phenomenon occurs comparatively often as when there is a plurality of subjects and the subjects are talking to each other. As a result, even if the same person is being captured continuously, when this person fails to be detected, it is determined that the makeup up individuals has changed and a problem which arises is that a representative frame is created a number of times for one and the same appearing person.
Further, since only an evaluation value per frame is used, if a person appears large in the center of a frame even momentarily even though the person is not the main person, the evaluation value rises and this frame happens to be extracted as the representative image. Further, a large number of similar representative frames are extracted and redundancy occurs if these are adopted as indices. Furthermore, since a case where shooting intervals involving a plurality of persons overlap each other is not considered, two representative images are extracted from a scene such as one in which two persons are conversing, and the representative images are redundant. Moreover, since only a face region is cut from the image, a situation in which two people are conversing cannot be ascertained. As a result, there are cases where the user can no longer grasp the content of a moving image appropriately.