1. Field of the Invention
The present invention relates to a technique for extracting a representative frame, which straightforwardly represents video contained in moving image data, from the moving image data.
2. Description of the Related Art
The widespread use of digital cameras and digital video camcorders, or the like, in recent years has made it possible for large quantities of moving images to be captured even by individuals. Since video data generally involves an enormous amount of data, the user of such video data fast-forwards or fast-rewinds video if the user wishes to ascertain an overview of content or searches for a desired scene. Accordingly, in order to ascertain the content of a moving image in a short period of time, a representative frame extraction technique has been proposed for selecting and presenting a frame that well represents the content of a moving image.
For example, the specification of Japanese Patent Laid-Open No. 2002-223412 (Patent Document 1) discloses a technique in which a series of images obtained by uninterrupted shooting using a single camera is adopted as one series of shots and a key frame is selected from the series of shots based upon a moment of playback time, such as the beginning, end or middle of the shot. It is arranged so that a plurality of shots are connected together as a single scene based upon similarity of the key frames and a prescribed number of key frames are selected from each scene. Further, in order that a frame containing a person who appears in the frame will be selected as a key frame, it is arranged so that a frame that includes a face region is selected preferentially.
The specification of Japanese Patent Laid-Open No. 2005-101906 (Patent Document 2) discloses a technique in which, if the makeup of persons that appear in a moving image has changed, the scenes of the moving picture are split, an index indicating the makeup of the persons is assigned to each scene and the scenes are searched. Furthermore, the specification of Japanese Patent Laid-Open No. 2001-167110 (Patent Document 3) discloses a technique for performing detection tailored to a face. Specifically, the faces of persons that appear in video are distinguished by identifying detected faces and a representative frame is selected based upon face orientation, size, number of faces, and the like in an interval in which a face has been detected.
Furthermore, the specification of Japanese Patent No. 3312105 (Patent Document 4) discloses a representative frame extraction technique that is based upon the statistical features of an object. Specifically, the image of an object desired to be detected is learned in advance, an evaluation value is found using a dictionary on a per-object basis by a prescribed method, and an index is generated based upon the evaluation value of each object obtained frame by frame.
However, with the technique described in Patent Document 1, even if a key frame containing a face has been selected, there are instances where a frame image is not one in a state favorable as far as the user is concerned, as when the face region is too small or is facing sideways, or the like. Further, there many instances where the target of shooting by a digital camera has a family as the subject, as when the goal is to show the growth of a child. In such cases, the techniques described in Patent Documents 2 and 3 for extracting a representative frame by using a person as the object of interest are such that only representative frames in which the faces of the family appear are lined up. In other words, since representative frames are selected by focusing upon a moving image interval in which persons and faces could be detected, frames that includes scenery or subjects that leave an impression in which persons and faces could not be detected are not selected as representative frames. Furthermore, with the technique described in Patent Document 4, an evaluation value is obtained on a per-frame basis. Consequently, in a case where the purpose is to ascertain the content of video in home video, a large number of similar frames become indices that are redundant.
In other words, if a representative frame is selected by taking a specific subject (a face, for example) as the object of interest, “who” appears in the image can be ascertained but information as to “where” the image was shot is missing. Consequently, a problem which arises is that in content such as personal content that has been shot as with a home video camera, a suitable representative frame cannot necessarily be extracted.