With recent improvements in processing, storage and networking technologies, many personal computing systems have the capacity to receive, process and render multimedia objects (e.g., audio, graphical and video content). One example of such computing power applied to the field of multimedia rendering, for example, is that it is now possible to “stream” media content from a remote server over a data network to an appropriately configured computing system for rendering on the computing system. Many of the rendering systems provide functionality akin to that of a typical video cassette player/recorder (VCR). However, with the increased computing power comes an increased expectation by consumers for even more advanced capabilities. A prime example of just such an expectation is the ability to rapidly identify, store and access relevant (i.e., of particular interest to the user) media content. Conventional media processing systems fail to meet this expectation.
In order to store and/or access a vast amount of media efficiently, the media must be parsed into uniquely identifiable segments of content. Many systems attempt to parse video content, for example, into shots. A shot is defined as an uninterrupted temporal segment in a video sequence, and often defines the low-level syntactical building blocks of video content. Shots, in turn, are comprised of a number of frames (e.g., 24 frames-per-second, per shot). In parsing the video into shots, conventional media processing systems attempt to identify shot boundaries by analyzing consecutive frames for deviations in content from one another. A common approach to distinguish content involves the use of color histogram based segmentation. That is, generating a color histogram for each of a number of consecutive frames and analyzing the histogram difference of consecutive frames to detect a significant deviation. A deviation within a single frame that exceeds a deviation threshold is determined to signal a shot boundary.
While the use of color histogram may prove acceptable in certain limited circumstances, it is prone to false shot boundary detection in many applications. Take, for example, news footage. News footage often includes light flashes from camera flash bulbs, emergency vehicle lights, lightning from a storm, bright stage lights for the video camera equipment and the like. The result is that one shot of such news footage may include a number of light flashes (flashlight phenomena) which conventional shot boundary detection schemes mistake for shot boundaries. Another example of media including the flashlight phenomena is action and science fiction movies, sporting events, and a host of other media—media for which conventional shot detection schemes are ill-suited.
The challenge of distinguishing flashlight phenomena from actual shot boundaries is not trivial. The limitation of conventional shot boundary detection schemes is that they assume a flashlight only occurs across a single frame. In the real world, not only can flashlights span multiple frames, the can span a shot boundary.
Another limitation of such conventional shot boundary detection schemes is that of threshold selection, i.e., the threshold of, for example, color histogram deviation that signals a shot boundary. Many conventional shot boundary detection schemes use global, pre-defined thresholds, or simple local window based adaptive thresholds. Global thresholds generally provide the worst performance as video properties often vary a lot and, quite simply, one size (threshold) does not fit all. The local window based adaptive threshold selection method also has its limitations insofar as in certain situations, the local statistics are polluted with strong noises such as, for example, loud noises and/or flashlight effects.
Thus, a method and apparatus for shot boundary detection is presented, unencumbered by the inherent limitations commonly associated with prior art systems.