1. Field of the Invention
The present invention generally relates to a multimedia browsing system, and more particularly, to a method for summarizing a news video stream using a synthetic key frame.
2. Description of the Related Art
Development of digital video and image/video/audio recognition techniques allows users to search/filter and browse desired portions of a video stream at a desired time point.
The most basic technique for a non-linear video content browsing and searching is a shot segmentation scheme and a shot clustering scheme, both of which are the most critical for structurally analyzing multimedia contents.
FIG. 1 illustrates an example of structural information of a video stream.
Referring to FIG. 1, structural information exists in the video stream which has a temporal continuity. In general, the video stream has a hierarchical structure regardless of genres. The video stream is divided into several scenes as logical units, in which each of the scenes is composed of a number of sub-scenes or shots. The sub-scene itself is a scene, and thus it has attributes of the scene as it is. In the video stream, the shots mean a sequence of video frames taken by one camera without interruption.
Most multimedia indexing systems extract the shots from the video stream and detect the scenes as the logical units using other information based upon the extracted shots to index structural information of the multimedia stream.
As described above, the shots are the most basic units for analyzing or constructing the video. In general, the scene is a meaningful component existing in the video stream as well as a meaningful discriminating element in story development or construction of the video stream. One scene may include several shots in general.
Conventional video indexing techniques structurally analyze the video stream to detect the shots and scenes as unit segments and extract key frames based upon the shots and scenes. The key frames represent the shots and scenes, and those key frames are utilized as a material for summarizing the video stream or used as means for moving to desired positions.
As set forth above, various researches are in progress for extracting a principal text region, a news icon, a human face region and the like that express meaningful information in the video stream for efficient video searching and browsing. Various methods have been introduced for synthesizing such key regions to generate new key frames. A synthetic key frame is a technique for synthesizing contents of the video stream in logical or physical units by using the key regions extracted from the scene or shot units. Using the synthetic frame, a great amount of information can be expressed in a small display space. A user can readily understand specific portions of the contents and selectively watch specific portions the user wants.
An application utilizing the synthetic key frame of the video text can be readily operated in all systems having a browsing interface for video searching and summarization of a specific range of the video stream.
Most of video indexing systems extract key frames to represent the scenes and shots as the structural components of the video stream, and use the same for the purpose of searching or browsing. In order to efficiently carry out the foregoing process, a method of extracting a synthetic key frame is presented.
FIG. 2 shows a concept of synthetic key frame generation.
Referring to FIG. 2, key frames are detected from scenes as logical units or shots as physical units in a video stream, and then the detected key frames are logically or physically synthesized to provide a user with synthetic key frames. Using the synthetic key frames, the user can readily understand video contents and rapidly accesses to desired positions.
Meanwhile, principal text regions expressing meaningful information in the video stream can be extracted for efficient video searching and browsing. This technique extracts a minimum block range (MBR) of the text displayed in a video image to provide a function for allowing the user to readily understand and index the contents of the video. Also, remote information searching can be executed on a network based upon flexible information searching and indexed information. Describing a method of extracting text in detail, candidate regions are primarily extracted based upon a property that horizontal and vertical edge histograms are concentrically appeared and information that the edge histogram is repeatedly varied in size as spaces of characters are varied. From the candidate regions, a region is extracted as a text region, which has an aspect ratio satisfying that of a text, a small amount of motion and a color with brightness highly different from that of the background.
In general, a news video stream in multimedia contents is a formalized/structured video data and the stream corresponds to a formalized model having a spatial/temporal structure. In other words, unlikely from general multimedia streams, the news video stream is formalized/structured video data, in which one news video stream is composed of several articles, and each article is composed of a summary section of the article explained by a news anchor and an episode section supporting contents of the article.
It can be seen that one news video stream includes several articles and one article includes the summary section of the article explained by the news anchor, i.e. anchor shot, and a content screen for supporting the contents of the article, i.e. episode shot. Considering the contents, general news contains all articles about politics, economy, social matters, sports, weathers and the like. Further, the news video has a formalized structure unlikely from video contents of other genre and each audience has his/her own interested articles apparently different from those of others. In practice, the audience or user generally wants to rapidly search a desired news article only.
In order to respond to the request that the user wants to rapidly search the desired news article only in point of the video indexing, various studies are under development to index the news video stream in the unit of article by using structured/semantic information of the news video.
For example, a method has been proposed for generating a synthetic key frame representing an article, in which importances are calculated about a plurality of text regions extracted from a video stream; and the synthetic key frame is generated using the text regions having importance measures at least a certain value.
As shown in FIG. 3, the synthetic key frame is generated by extracting text regions which are frequently used as elements for comprehensively delivering the video contents; determining weights using information such as the size of text region, the mean text size in text region, the display duration time of text and the like; calculating importances about the text regions based upon the determined weights; and synthesizing the text regions having the importance measures at least a certain value based upon the calculated importances. Therefore, search and browsing of the video contents based upon text can be implemented by providing the synthetic key frame to the user. Also, the text-based synthetic key frame using the importance measures as above has advantages that can help user's understanding and comprehensively deliver the video contents by summarizing the video contents key contents using the text having the high importance.
Therefore, various non-linear news video browsing techniques are under continuous development, in which an interface such as Table Of Contents (TOC) or a storyboard is incorporated into the conventional news video data indexed in the unit of article using the temporal structure of the news video.
However, it is very difficult to select the key frame capable of representing the each article or scene for the news video. The simple storyboard-type summarizing method is very inefficient to summarize the news article because it cannot efficiently deliver information about the scene as the actual story unit to the user. Accordingly, it has a disadvantage that the contents of the entire news are barely delivered to the user in a direct manner.
Further, if the synthetic key frame is generated via simple importance calculation as implemented in the related art, characteristics of genre or semantic/structural information of the video contents are hardly utilized so that the text regions containing the important meanings may occasionally be excluded in the process of calculation.