This invention generally relates to image processing and, more particularly, to a media segmentation system and related methods.
With recent improvements in processing, storage and networking technologies, many personal computing systems have the capacity to receive, process and render multimedia objects (e.g., audio, graphical and video content). One example of such computing power applied to the field of multimedia rendering, for example, is that it is now possible to xe2x80x9cstreamxe2x80x9d video content from a remote server over a network to an appropriately configured computing system for rendering on the computing system. Many of the rendering systems provide functionality akin to that of a typical video cassette player/recorder (VCR). However, with the increased computing power comes an increased expectation by consumers for even more advanced capabilities. A prime example of just such an expectation is the ability to rapidly access relevant (i.e., of particular interest to the user) media content. Prior art systems fail to meet this expectation.
To accommodate and access the vast amount of media, a variety of image database and visual information systems have become available recently. Such systems have been used in a wide variety of applications, including medical image management, CAD/CAM systems, criminal identification systems, clip-art galleries and the like. Prior art systems may employ any of a number of search techniques to access and retrieve relevant information. By and large, such prior art systems utilize a text-based, keyword approach for indexing and retrieving such media content. In accordance with such an approach, each frame, shot or scene (each comprised of one or more of the former) is stored as a database object, wherein each image (e.g., frame, shot, scene) in the database is associated with a manually generated text description of that object. These keyword descriptors may then be searched by standard Boolean queries, where the retrieval is based on either exact or probabilistic matches of the query text.
While such prior art systems have served to whet the appetite for such technology, none of the prior art systems facilitate true content-based media searching and, thus, fail to fully address the need to accurately access and retrieve specific media content. There are several problems inherent in systems that are exclusively text-based. Automatic generation of the descriptive keywords or extraction of semantic information required to build classification hierarchies is beyond the current capability of computing vision and intelligence technologies. Consequently, the text descriptions of such images must be manually generated. It is to be appreciated that the manual input of keyword descriptors is a tedious, time-consuming process prone to inaccuracies and descriptive limitations. Moreover, certain visual properties, such as textures and patterns are often difficult, if not impossible, to adequately or accurately describe with a few textual descriptors, especially for a general-purpose indexing and retrieval applications.
While other approaches have been discussed which attempt to qualitatively segment media based on content, all are computationally expensive and, as a result, are not appropriate for near real-time consumer application. These prior art approaches typically attempt to identify similar material between frames to detect shot boundaries. Those skilled in the art will appreciate that a shot boundary often denotes an editing point, e.g., a camera fade, and not a semantic boundary. Moreover, because of the computational complexities involved, such shots are often defined as a static, or fixed number of frames preceding or succeeding an edit point (e.g., three frames prior, and three frames subsequent). In this regard, such prior art systems typically utilize a fixed window of frames to define a shot.
In contrast, scenes are comprised of semantically similar shots and, thus, may contain a number of shot boundaries. Accordingly, the prior art approaches based on visual similarity of frames between two shots often do not to produce good results, and what needed is a quantitative measure of semantic correlation between shots to identify and segment scenes.
Thus, a media segmentation system and related methods is presented, unencumbered by the inherent limitations commonly associated with prior art systems.
This invention concerns a media segmentation system and related methods, facilitating the rapid access and retrieval of media content at a semantic level. According to an exemplary implementation of the present invention, a method is presented comprising receiving media content and analyzing one or more attributes of successive shots of the received media. Based, at least in part on the analysis of the one or more attributes, generating a correlation score for each of the successive shots, wherein scene segmentation is performed to group semantically cohesive shots.