Reliable detection and signaling of shot changes within a sequence of images, e.g., a video sequence, is a difficult but well-researched problem in the art. Reliable detection and signaling of shot changes has found many applications in the field of video signal processing, including cadence detection, de-interlacing, format conversion, compression encoding, and video indexing and retrieval. Shot changes are easily identified by a human viewer—such events include changeover from an episodic television program to an advertising spot or camera changes such as when a live news studio broadcast cuts from one camera angle to another on the same set.
Unfortunately, reliable shot change detection by machines has proven elusive due to the lack of a precise definition of what constitutes a machine-detectable “shot change” (temporal) or “shot boundary” (locality). As used herein, a machine-detectable “shot change” may be defined as a positive indication that a given “uninterrupted image sequence captured by a single camera capture” has changed to, or is changing to, another different “uninterrupted image sequence captured by a single camera.” Additionally, when there is a smooth transition from one shot to the next, such as during a fade transition, it is assumed that the indication occurs at the end of a shot change transition.
It is desirable to have an effective automated shot change detection method for successful operation of many video processing systems. For example, many motion compensated video de-interlacing systems use some form of automated shot change detection to determine when to reset or zero-out their motion-compensation buffers. It is also used as an automated preprocessing step, with human intervention correcting errors in the automated step, for providing video segmentation boundaries for later work that would otherwise be prohibitively time-consuming to produce without machine assistance. Shot change detection is easily, if slowly, provided by human analysis. It is, in most cases, extremely sensitive to context.
Shot changes take many forms—of primary concern are abrupt shot changes. Other forms of shot changes include fade transitions that may span dozens of video frames, including fade-from-black, fade-to-black, and mixing fade from one shot to another. Another form of shot changes are “wipes”, wherein a new shot (i.e., captured image or sequence of images in a video) is introduced as a superimposed overlay (or partial mix) of the new and an old shot, with a transition taking place over a number of frames. An example is a right-to-left wipe where the new shot is introduced incrementally from right-to-left over the course of half of a second, e.g., over 15 video frames.
The performance of a shot change detection method is determined primarily by the requirements of the system and application(s) employing the system. These requirements are most succinctly expressed by a combination of precision and recall. As used herein “precision” refers to an indication of a false positive ratio (values approaching 100 percent refer to fewer erroneously positive shot change indications), while “recall” refers to an indication of a false negative ratio (values approaching 100 percent refer to fewer missed shot changes).
No prior art shot change detection method and system has approached 100% precision and recall, including state-of-the-art academic research computer-vision study systems requiring vast computational resources for offline evaluation. However, some systems utilizing shot change detection may be more sensitive to false positives than false negatives, or vice versa. An example is a motion-compensated de-interlacing system. An overabundance of false positives may result in the de-interlacing system reverting to spatial methods instead of temporal methods, which may reduce effective resolution of output temporarily for the duration of a shot change. An overabundance of false negatives may lead to bad motion compensation, resulting in very visible visual artifacts and discontinuities that most viewers would find objectionable. In such circumstances, the best trade-off would be for the shot change detection system to be highly sensitive and prone to false positives, rather than failing to report marginal indications of shot changes that may result in visible artifacts.
An example of a condition in which false positives is undesirable is when a system that utilizes shot change detection to modulate motion estimation contribution to a single-camera depth map extraction statistical model. In such circumstances, the motion estimation may only contribute a small part of the overall model, but an overabundance of false-positive shot change indications introduce severe model instabilities, resulting in rapid temporal changes to the depth model. This, in turn, results in visible modulations of depth over time as viewed by a human observer.
Identifying shot changes via an automated system is problematic—methods known in the art suffer from two primary defects: (1) false-positive shot change detections (i.e., low precision), as when a flashbulb temporarily lights a stadium for a single frame or field of video, and (2) false-negatives or “missed” shot change detections (i.e., low recall), as when a director cuts from a first camera to another camera pointing to the same scene and object as the first camera, but from a slightly different angle. Other prior art systems have been proven reliable but are complex in that they require very expensive and extensive computation and concomitant hardware. As a result, such systems are not economically feasible for real-time applications.
Several prior art methods for detecting shot changes and their limitations have been well-documented and evaluated in an article by Smeaton, A. F., Over, P., and Doherty, A. R., titled “Video shot boundary detection: Seven years of TRECVid activity,” Comput. Vis. Image Underst. 114, 4 (April 2010), 411-418 (hereinafter “Smeaton et al.”). The nomenclature and taxonomy of Smeaton et al. are incorporated herein by reference in their entirety and shall be used in the remainder of the description.
Existing shot change detection methods include employing a continuity measure, such as changes in overall color values from image-to-image. Continuity measures may include statistical measurements, such as the mean luminosity of images, or multichannel histogram-based methods. Alone, such continuity measures typically result in false shot change detections (positive and negative). Other methods build upon the concept of continuity by incorporating motion vector data from compressed video data. However, such methods rely on the accuracy and robustness of encoded motion vectors. These motion vectors are not calculated and encoded for the purpose of shot change detection. Motion vectors are typically used to encode a residual prediction signal, where robustness and accuracy is not of primary concern. Unfortunately, methods that employ motion vectors as a primary feature for detecting a shot change encounter frequent false positives and negatives as well.
Other methods employ Bayesian statistical formulations to estimate a likelihood value for detecting shot changes; however, Bayesian statistical formulations are complex and are not applicable to real-time or near-real-time applications.
Yet other methods utilize some variation of a feature-vector approach calculated on all, or certain parts of, incoming images; however, such approaches require some heuristics that work well for certain classes of video (e.g., tennis matches), but not others (e.g., episodic TV dramas), and as a result, require some human intervention in order to ensure accuracy and, thus, are not suitable for automated shot change detection.
Further, many exiting shot change detection methods cannot positively identify shot changes when the shot boundary spans two scenes with similar luminance, color, scene content characteristics, or other pathological cases. Additionally, most of the methods of the prior art rely on simple heuristic thresholds and therefore are unreliable when confronted with unusual video sequences, or boundary cases. Further, these methods are all “hard-wired” in the sense that the robustness of shot change detection with respect to recall and precision are not readily interchangeable, nor selectable.
Accordingly, what would be desirable, but has not yet been provided, is a system and method for real-time or near-real-time automatic, unattended detection of shot changes within an image sequence, with a minimum of false positives and false negatives. Such a system and method would allow an external system to reliably choose to what degree recall, precision, or both, are of primary concern in the external system requiring shot change detection.
Tunable precision and/or recall is of importance to real-time, or near-real-time combined systems that may need to perform motion-compensated de-interlacing as an early step, and motion-compensated frame rate conversion as a later step, without requiring independent, expensive shot change detection systems, each tuned for particular precision and recall requirements. While such a system is illustrative, it is by no means limiting. By way of example, a system performing the tasks of film pull-down cadence detection, and then motion-compensated de-noising, may have different requirements for precision and recall for each of these steps, but would be equally well served by certain embodiments of the present invention.