This invention relates to a signal processing method for measuring the similarity between mutually different arbitrary segments constituting signals and a image-voice processing apparatus for measuring the similarities between mutually different arbitrary image and/or voice segments constituting video signals.
There is a case where it is desirable to search and reproduce interesting parts and often desired parts from an image application composed of massive different image data, for example a TV program recorded in video data.
In searching video data and other multimedia data, essentially unlike data used in many computer application, one cannot expect to find exactly identical data and similar ones are searched. Therefore almost all the technologies out of those relating to search on the multimedia data are based on similarity-based search as described in xe2x80x9cG. Ahanger and T. D. C. Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7:28-4. 1996.xe2x80x9d
In such search technologies based on similarity, the similarity of the contents is measured numerically in the first place. And in this technology, the measurements of similarity are used to rank those data of descending levels of similarity beginning with the highest level based on the standard of measuring similarity with the subject item. In a list obtained thereby, the most similar data themselves appear near the top of the list.
In such a search method based on the contents of multimedia data, image data, voice data, and essentially the video processing technologies based on signal processing are used in the first place to extract a low level feature of multimedia data. And in this search method, the inventors extracted low level features to find a standard of measuring similarity required for searches based on similarity.
Studies on searches based on the contents of multimedia data are often focussed at first on images (still images) searches. In such studies, the similarity among images is measured by a large number of low level image features such as color, texture, shape, etc.
And lately studies on searches based on the contents of video data have also been conducted. In the case of video data, identical parts in long video data are usually searched. Therefore in most technologies related to CBR (Contents Base Retrieval), video data are at first divided into a stream of frames called segments. Those segments are the subject of searches based on similarity. As for the existing method for dividing video data into segments, for example, usually a shot detection algorithm is used to divide video data into so-called shots as described in xe2x80x9cG. Ahanger and T. D. C. Little, A survey of technologies for parsing and indexing digital video, J. of Visual Communication and Image Representation 7:28-4. 1996.xe2x80x9d And in such search, the features that enable comparison based on similarity from the shot obtained are extracted.
However, it is difficult to identify the remarkable features of shots and detect features that enable to compare shots based on similarity. Therefore, the existing approach to search based on the contents of video data was, in place of the above-mentioned method, usually to extract representative frames from each shot and search for those representative frames. Those representative frames are generally called xe2x80x9ckey frames.xe2x80x9d In other words, search technologies based on the contents of shot are attributed to search technologies based on the contents of image by comparing shot key frames. For example, when colour histograms are extracted from key frames for each shot, and the histograms of these key frames can be used to measure the similarity of two shots. This approach is also effective for selecting the key frame.
A simple approach is to regularly select a fixed frame from each shot. Another method for selecting a large number of frames is to use the frame-difference described in xe2x80x9cB. L. Yeo and B. Liu, Rapid scene analysis on compressed video, IEEE Transactions on Circuits and Systems for Video Technology, vol.5, no.6, pp.533, December 1995xe2x80x9d, the motion analysis described in xe2x80x9cW. Wolf, Key frame selection by motion analysis, Proceedings of IEEE Int""l Conference on Acoustic, Speech and Signal Proceeding, 1996xe2x80x9d, and the clustering technology described in xe2x80x9cY. Zhuang, Y. Rui, T. Huang and S. Mehrotra, Adaptive key frame extraction using unsupervised clustering, Proceedings of TEEE Int""l Conference on Image Proceeding, Chicago, Ill. Oct. 4-7 1998.xe2x80x9d
Incidentally, the above-mentioned search technology based on key frames is limited to searches based on the similarity of shots. However, for example, since a typical 30-minutes TV program contains hundreds of shots, in the above-mentioned prior search technology a tremendous number of extracted shots need to be checked and searching such a huge number of data was quite a burden.
Therefore, it was necessary to mitigate the burden by comparing the similarities among, for example, scenes and programs in which segments are grouped together based on a certain correlation and other image and voice segments longer than shots.
However, the prior search technologies have not met the requirements for, for example, searching segments similar to specific commercials or, searching scenes similar to a scene consisting of related group of shots describing an identical performance in a TV program.
As mentioned above, almost no published studies devoted to comparisons based on the similarity of segments at higher levels than shots have been found. The only study of this kind is xe2x80x9cJ. Kender and B. L. Yeo, Video Scene Segmentation via Continuous Video Coherence, IBM Research Report, RC21061, Dec. 18, 1997xe2x80x9d. This study provides a method for comparing the similarities between two scenes. The search technology in this study classifies all the shots of video data into categories and then counts the number of shots in every scene attributed to each category. The result obtained is a histogram that can be compared by the standard criteria of comparing similarity. It is reported that the study was successful to some extent in comparing similarity among similar scenes.
However, this method requires the classifications of all the shots of video data. Classifying all the shots is a difficult task and usually needs a technology requiring an enormous amount of computation.
Even if this method could exactly classify all the shots, it did not take into account the similarity between categories, and therefore the method could give confusing results. For example, suppose that a shot of video data are divided into three categories A, B, and C, a scene X has no shot of the categories B and C but has two shots of the category A, and another scene Y has no shot of the categories A and C but has two shots of the category B. In this case, according to the method, no similarity is found to exist between the scene X and the scene Y. However, if the shots in the category A and the category B are mutually similar, the similarity value should not be zero. In other words, the fact that in this method no similarities of shots themselves are taken into account sometimes leads to such a misjudgment.
This invention was made in view of such a situation, and has an object of solving the problems mentioned above of the prior search technologies, and of providing a signal processing method and an image-voice processing apparatus for search based on the similarity of segments of various levels in various video data.
The signal processing method related to the present invention designed to attain the above object is a signal processing method that extracts signatures defined by the representative segments which are sub-segments that represent the contents of segments constituting signals supplied out of the sub-segments contained in the segments and a weighting function that allocates weight to these representative segments including a group selection step that selects object groups for the signatures out of the groups obtained by a classification based on an arbitrary attribute of the sub-segment, a representative segment selection step that selects a representative segment out of the groups selected in the group selection step, and a weight computing step that computes the weight for the representative segment obtained in the selection step.
The signal processing method related to the present invention extracts the signature related to the segment.
The image-voice processing apparatus related to the present invention designed to attain the above object is an image-voice processing apparatus that extracts signatures defined by the representative segments which are image and/or voice sub-segments that represent the contents of the image and/or voice segments constituting video signals supplied out of the image and/or voice sub-segments contained in the image and/or voice segments and a weighting function that allocates weight to these representative segments including an execution means that selects object groups for the signatures out of the groups obtained by a classification based on an arbitrary attribute of the image and/or voice sub-segments, selects a representative segment from these selected groups and computes a weight for the representative segment obtained thereby.
The image-voice processing apparatus related to this invention thus configured extracts signatures relating to the image and/or voice segment.