1. Field of the Invention
The present invention relates to a scheme for detecting captions in video data, and more particularly, to a scheme for detecting captions in coded video data as well as a video retrieval, a video content indication display, and a video display based on the coded video caption detection.
2. Description of the Background Art
As a method for extracting an information indicative of the video content from the video for the purpose of carrying out a processing based on the video content such as video retrieval or video editing, a method for extracting caption regions from the video has been known. Here, the captions generally include texts, photographs, symbols, patterns, markings, icons, etc., which are made to appear in the video by using a technique such as the superimpose technique, and the caption region is a pixel or a set of pixels which contain such a caption.
The conventionally known methods for automatically extracting caption regions from the video include a method which utilizes the property that the caption region has a relatively high intensity compared with the background region so that its edge can be easily detected (see R. Lienhart et al.: "Automatic text recognition in digital videos", Image and Video Processing IV, Proc. SPIE 2660-20, January 1996, for example), and a method which utilizes the fact that the caption region has large intensity differences at its periphery (see M. A. Smith et al.: "Video Skimming for Quick Browsing based on Audio and Image Characterization", Carnegie Mellon University, Technical Report CMU-CS-95-186, July 1995, for example).
In Lienhart et al., the frame image is segmented by the split and merge algorithm, and a caption region is detected according to a size of a region and its motion between frames. In this method, the segmentation utilizes the fact that the caption has a uniform pixel value so that the caption and the background are effectively separated according to a difference in intensities.
In Smith et al., a caption region is detected by obtaining and smoothing an edge of the image. This method utilizes the fact that the caption has a relatively high contrast compared with the background so that the edge of the caption becomes sharp.
As a modification of the latter type of the conventionally known method, there is also a proposition for improving the precision of the caption extraction by averaging several frames that contain the caption so as to emphasize the caption while reducing an influence of background fluctuations.
Now, in order to extract caption regions from the coded video which is coded by utilizing the inter-frame correlation, if any of the conventionally known methods as described above is to be used, it would be necessary to decode the coded video completely once so as to restore the original frame images, and then carry out the extraction processing as described above with respect to the restored original frame images. However, this provision requires the image decoding processing in addition to the caption region extraction processing, so that the processing cost would be high and the high speed caption region extraction would be difficult.
In addition, in a case of applying the above described method for averaging a plurality of frames to the coded video, it is necessary to carry out the averaging after a plurality of frame images are all decoded, so that the processing cost would be even higher.
Now, the conventional methods for detecting captions from the video have been based on local characteristics obtained from one to several frame images.
For instance, there is a conventional method which utilizes the fact that the caption region has large intensity differences on its edge, in which the caption is detected by finding a frame in which the caption appears, and taking differences of intensity and color with respect to frames before and after the caption appearance.
Also, there is a conventional method which utilizes the property that the caption region has a relatively high intensity compared with the background region so that its edge can be easily detected, in which the caption is detected by using the edge detection based on the first order derivative of the image and the projections of the edge image into vertical and horizontal directions.
Also, there is a conventional method which utilizes the fact that the caption is stationary and has a high intensity, in which a text portion is detected by obtaining a portion which has no motion between two frames and an intensity greater than or equal to a prescribed value (see Japanese Patent Application No. 8-331456 (1996)).
As such, the conventional methods for detecting captions from the video are utilizing the time-wise localized information such as one or two frame images. For this reason, these conventional methods have been associated with a problem that an imaged object other than the caption which has the similar characteristics as the caption, such as the characteristics of being stationary, having a high intensity, and having large high frequency components, could be erroneously detected as the caption.
On the other hand, there has also been a problem that the caption which appears on the video for a long period of time would not be correctly detected as the caption when there is a temporal movement or a contour blurring due to an influence of image degradation, noises, etc. As a consequence, there have been cases in which the single continuous caption is erroneously detected for multiple times as different captions over a plurality of time sections.
In other words, the conventional methods are judging the existence of the caption according to a certain short time section, so that it is difficult to avoid an erroneous detection of an imaged object other than the caption or an erroneous overlooking of the caption due to noises. Consequently, when any of the conventional methods is used for the purpose of obtaining a list of captions from the video, there are cases in which an imaged object other than the caption is erroneously displayed or a single caption is displayed more than once in overlaps.
Now, in conjunction with increasing activities in video distributions such as the television broadcasting, the digital satellite broadcasting, the laser disks, the digital video disks, and the video-on-demand, etc., there are increasing demands for flexible handling of video data. To this end, there have been propositions of techniques which attach various kinds of contents or index information to the video so as to enable the retrieval of and/or the random access to the video. As an information which characterizes the video, the captions which generally include texts, photographs, symbols, patterns, markings, icons, etc., are important as they reflect the meanings or the contents of the video. For this reason, there have been propositions of a method for automatically detecting captions from the video.
For example, there is a conventional method disclosed in Japanese Patent Application No. 8-331456 (1996) mentioned above, which utilizes the fact that the caption is stationary and has a high intensity, in which a text portion is detected by obtaining a portion which has no motion between two frames and an intensity greater than or equal to a prescribed value
Also, there is a conventional method which utilizes the property that the caption has a sharp edge and a high intensity, in which a text portion is detected by obtaining a block for which both the edge sharpness and the intensity of the frame image are greater than prescribed thresholds (see Japanese patent Application No. 8-212231 (1996)).
As such, the conventional methods for detecting captions from the video are detecting the caption by utilizing the property of the caption itself such as itsedge sharpness or its intensity so that there has been a problem that an ability for detecting a switching point between captions has been low.
For instance, in Japanese Patent Application No. 8-212231 mentioned above, the frame image is segmented into blocks and the text region data corresponding to the blocks are provided. In the text region data, a value "1" is stored for each block at which the caption exists while a value "0" is stored for each block at which the caption does not exist. Then, a number of blocks with different values in the text region data between two frame images is counted, and when this counted number exceeds a prescribed value, it is judged that a caption is switched to another caption.
However, in this conventional method, no change appears in the text region data when the captions are switched without a break and without a change in their areas, so that it is still impossible to detect a switching point between the captions in such a case.
Now, there are various video retrieval methods based on the video content for the purpose of detecting a desired video portion from a huge amount of video data, and among them, a method which utilizes the caption contained in the video as a retrieval key has been attracting much attentions because the caption is usually formed by characters and symbols which have clear meanings, while there are typical appearance patterns for a position of a caption, so that the caption can reflect the video content quite well.
In the conventional video retrieval method, the desired video portion is retrieved by extracting an image of the caption region from the video, recognizing the characters contained in the caption, and comparing the recognized character information with the retrieval key. In this conventional video retrieval method, the edge extraction based on the first order derivative of the image is carried out, the edge image is projected into vertical and horizontal directions, and a rectangular region in which the caption exists is extracted. Then, the character recognition is carried out by using the feature vector classification techniques.
However, in the conventional video retrieval method described above, it has been impossible to realize the video retrieval based on a position of appearance of the caption. In addition, for the purpose of interpreting the caption by utilizing the character recognition, it has been necessary to carry out a high cost character recognition processing. Moreover, the character recognition rate has not been very high so that the retrieval efficiency has not been very good. Furthermore, the character recognition target image is required to have a high quality so that a high processing cost is also required for extracting the image of the caption region at high quality.
Now, there is a conventionally known system for generating and displaying video content indications, which uses video content indications based on shot boundaries in the video. For example, Japanese Patent Application Laid Open No. 4-237284 (1992) discloses a system in which the shot boundaries in the video are detected by using the inter-frame correlation and utilized as the video content indications. Moreover, in this conventional sytem, the video are segmented into short sections called shots according to the detected shot boundaries, and a representative frame image of each shot is displayed as a video content indication display.
However, this conventional video content indication display system has been associated with a problem that the generated video content indication is in an excessively fine granularity so that the video is cut into pieces too minutely because the video is handled according to the shot boundaries.
On the other hand, M. Mills et al.,: "A Magnifier Tool for Video Data", Proceedings of CHI '92, ACM, pp. 93-98, May 1992, disclose a method in which images obtained by sampling the video at constant time intervals hierarchically according to the temporal resolution of the outlines are displayed in parallel on a video display. In this method, the coarsely sampled images are displayed first, and a specified section is displayed with more finely sampled images in a case of viewing a particular section in further detail.
However, this conventional video content indication display method has been associated with a problem that, when a plurality of shots are integrated into a coarse video section, there is no guarantee that the integrated coarse video section actually reflects the video content well.
Now, in displaying or editing the video by reusing the already used video which contains captions, there can be a case in which the original captions are no longer desirable as their contents are not suitable for a newly intended use of that video. In such a case, the reusability of the video can be increased by displaying the video while obscuring the captions contained in the video.
The conventionally available methods for obscuring a part of the video include various video processing methods such as video tessellation, smoothing, pixel interchanges, noise application, etc. In these video processing methods, the video processing is carried out by specifying a portion to be obscured. Consequently, in order to display the video by obscuring the captions, it is necessary to carry out the video processing by specifying the caption regions.
However, in order to display the video while obscuring the captions by using any of these conventionally known methods for obscuring a part of the video, it is necessary for human workers to manually specify the caption regions to be obscured one by one, and for this reason, the works required for the purpose of increasing the reusability of the video by obscuring the captions become quite tedious and it is difficult to carry out such tedious works at high speed.