As an example of the technology related to the method for finding correspondence between image segment and sound segment, Patent Document 1 (Japanese Patent Publication. Kokai JP-A No. 2004-56286) discloses an image display method that in order to improve accuracy in finding correspondence between a person in image and speech and to display text data converted from voice in accurate correspondence with the speaker. This method extracts a person area from image data, calculates image feature, estimates a person based on the image feature and calculates voice feature from voice data, estimates a person from the voice feature, collects statistics on the situation in which a person estimated from the image data and a person estimated from the voice data appear in the same scene at the same time and, based on the statistical result, identifies a person who has generated the voice data, and displays the text data converted from voice near the identified person on the display screen.
The method disclosed in Patent Document 1 finds correspondence person segments, generated by classifying image segments in an input video according to persons that appear, and the voice segment groups, generated by classifying the sound segments according to the persons that appear, based on the statistics on the situation in which person segments and voice segments appear at the same time in the same scene.
For example, between the person segments of N persons and the voice segments of M persons estimated from the sound data, the number of times person segments and voice segments appear in the same scene at the same time is collected and, based on the collected result, correspondence between person segments and sound segments, which have a high appearance correlation, is made. As shown in FIG. 17, the configuration to establish correspondence in this way comprises a person extraction means 600 that detects a person from the image data of an input video and extracts the image feature, a voice extraction means 601 that detects human voices from the sound data and extracts the voice feature, a person segment classification means 602 that classifies the segments, in which persons are detected, based on the image feature, a voice segment classification means 603 that classifies the segments, in which human voices are detected, based on the sound feature, a simultaneous occurrence statistic means 604 that collects statistics on the situation in which person segments and voice segments occur simultaneously, and a person identification means 605 that find correspondence between person segments and sound segments based on the statistical quantity.
An example of the related technology of a system for individually making correspondence between an image segment and a sound segment is disclosed in Non-Patent Document 1. Non-Patent Document 1 discloses a method for making correspondence between voices and videos via DP matching with the aim of synchronizing multimedia data (text, video, and voice in the scenario document) that is not temporally synchronized in advance. The method disclosed in Non-Patent Document 1 makes correspondence between image segments and sound segments by non-linearly expanding and compressing the appearance patterns of the segments for the image segments and sound segments, extracted in advance as an observation result of the same object in the input video, to find an optimal match. For example, based on the image feature and the sound feature of a particular person acquired in advance, the disclosed method makes correspondence between image segments and sound segments by non-linearly expanding and compressing the patterns to find the optimal match using “patterns regarding presence/absence of a particular-person” obtained from the image data of the input video and “patterns regarding presence/absence of a particular-person” obtained from the voice data. As shown in FIG. 16, the configuration comprises an image segment extraction means 500 that extracts the image segments, in which a particular person appears, from the image data in the input video, a voice segment extraction means 501 that extracts the voice segments of a particular person from the voice data, and a DP matching means 502 that makes correspondence between image segments and sound segments by non-linearly expanding and compressing the appearance patterns of the image segments and the appearance patterns of the sound segments to find an optimal match.
Patent Document 1:
Japanese Patent Kokai Publication No. JP2004-56286A
Non-Patent Document 1:
Yoshitomo Yaginuma and Masao Sakauchi, “A Proposal of an approach for making correspondence for Drama Video, Sound and Scenario Document Using DMatching”, IEICE (Institute of Electronics, Information and Communication Engineers) Transactions D-II Vol. J. 79-D-II No. 5, May 1996, pp. 747-755
Non-Patent Document 2:
Iwai, Lao, Yamaguchi, and Hirayama, “A Survey on Face Detection and Face Recognition”, Study Report from Information Processing Society of Japan (CVIM-149), 2005 pp. 343-368
Non-Patent Document 3:
Shigeru Akamatsu, “Computer Recognition of Human Face—A Survey—” IEICE Transactions Vol. J80-A No. 8 pp. 1215-1230, August 1997