The present invention relates to a technique for detecting and indexing semantically significant events in a video sequence.
There has been a dramatic increase in the quantity of video available to the public. This trend is expected to continue or accelerate in the future with the convergence of personal computers and digital television. To improve the value of this information to users, tools are needed to help a user navigate through the available video information and find content that is relevant. For xe2x80x9cconsumerxe2x80x9d users, such tools should be easy to understand, easy to use, and should provide reliable and predictable behavior.
Generally, there are three categories of known content-based video indexing and retrieval systems. A first category includes methods directed to the syntactic structure of video. This category includes methods of shot, boundary detection, key frame extraction, shot clustering, tables of contents creation, video summarizing, and video skimming. These methods are generally computationally conservative and produce relatively reliable results. However, the results may not be semantically relevant since the methods do not attempt to model or estimate the meaning of the video content. As a result, searching or browsing video may be frustrating for users seeking video content of particular interest.
A second category of video indexing and retrieval systems attempts to classify video sequences into categories, such as, news, sports, action movies, close-ups, or crowds. These classifications may facilitate browsing video sequences at a coarse level but are of limited usefulness in aiding the user to find content of interest. Users often express-the object of their searches in terms of labels with more exact meanings, such as, keywords describing objects, actions, or events. Video content analysis at a finer level than available with most classification systems is desirable to more effectively aid users to find content of interest.
The third category of techniques for analyzing video content applies rules relating the content to features of a specific video domain or content subject area. For example, methods have been proposed to detect events in football games, soccer games, baseball games and basketball games. The events detected by these methods are likely to be semantically relevant to users, but these methods are heavily dependent on the specific artifacts related to the particular domain, such as editing patterns in broadcast programs. This makes it difficult to extend these methods to more general analysis of video from a broad variety of domains.
What is desired, therefore, is a method of video content analysis which is adaptable to reliably detect semantically significant events in video from a wide range of content domains.
The present invention overcomes the aforementioned drawbacks of the prior art by providing a method of detecting an event in a video comprising the steps of analyzing the content of the video; summarizing the analysis; and inferring an event from the summary. The video event detection process is, thus, decomposed into three modular levels. Visual analysis of the video content, including shot detection, texture and color analysis, and object detection occurs at the lowest level of the technique. At the second level, each shot summarized based on the results produced by the visual analysis. At the highest level of the technique, events inferred from spatial and temporal phenomena disclosed in the shot summaries. As a result, the technique of the present invention detects event which are meaningful to a video user and the technique may be extended to a broad spectrum of video domains by incorporating shot summarization and event inference modules that are, relatively, specific to the domain or subject area of the video which operate on data generated by visual analysis processes which are not domain specific.