1. Technical Field
The present invention relates generally to digital video processing and analysis and, more specifically, to a system and method for detecting scene changes utilizing a combination of a plurality of difference metrics applied to successive video frames and adaptive threshold selection methods for the metrics so as to identify abrupt and gradual scene changes.
2. Description of Related Art
The use of digital video in multimedia systems is becoming quite popular. Moreover, videos are playing an increasingly important role in both education and commerce. Indeed, video is utilized in connection with a myriad of applications such as multimedia systems, defense/civilian satellites, scientific experiments, biomedical imaging, industrial inspections, home entertainment systems, etc. As a result, there has been a rapid increase in the amount of video data generated. The process of searching through a video so as to annotate and/or obtain a quick overview of the video content can be tedious and time consuming using conventional digital video applications such as fast forward or rewind. Therefore, in order to search for desired video content in an effective and efficient manner, the videos should be appropriately indexed into a database. In this manner, a user can readily retrieve desired videos or sections of video without having to browse the entire video database.
The primary task in this indexing process is segmenting the video into meaningful continuous units or “shots,” which affords an efficient method for video browsing and content based retrieval. A “shot” or “take” in video parlance refers to a contiguous recording of one or more video frames depicting a continuous action in time and space. Typically, transitions between shots (also referred to as “scene changes” or “cuts”) are created intentionally by film directors. A statistical characterization of a video can also be performed in terms of the different attributes of a shot such as their length and type (close, medium, long, camera movement etc.). At a global level, this can be used for video clustering.
During a shot, the camera might remain fixed or it might undergo one of the characteristic motions, i.e. panning, zooming, tilting or tracking. In general, the process of segmenting a video into a sequence of shots is non-trivial, complicated by the large variety of transitions between shots made possible by the modern editing machines. Shot transitions consisting primarily of visually abrupt changes or camera breaks (also called “straight cuts”) are readily identifiable by examining frame to frame intensity changes at the pixel level.
In many cases, however, a transition between two shots is made in a gradual manner using special editing machines to achieve a visually pleasing effect. These type of gradual changes are also called “optical cuts.” There are several types of optical cuts, such as “fade in”, “fade out”, “dissolve”, “wipe”, “flips”, “superimpose”, “blow-ups”, “move-ins”, etc. (see, e.g., P. Aigrain, et al., “The automatic real time analysis of film editing and transition effects and its applications,” Computer Graphics, 18:93-103, 1994). In general, the process of detecting optical cuts is challenging and requires a more detailed analysis of the video at a global scale rather than merely analyzing the interframe difference such as the case with conventional methods.
Conventional methods related to scene change detection began with the study of frame difference signals, where it was demonstrated that the gamma distribution provides a good fit to the probability density function of frame difference signals (see, A. Seyler, “Probability distributions of television frame differences,” Proceedings of the IEEE, 5:355-366, 1965). Subsequent studies of frame difference signals were performed in connection with four different types of videos, a sports video, an action video, a talk show and a cartoon animation by subdividing the images into 8×8 blocks with each block being represented by the average value, and then computing the difference metric as the average value of the magnitude of the interframe difference (see D. Coll, et al., “Image activity characteristics in broadcast television,” IEEE Transactions on Communications, 26:1201-1206,1976). It was shown that this measure was capable of detecting shot boundaries with reasonable accuracy.
Other conventional methods for cut detection have been recently proposed that are characterized either as frame difference methods or as histogram based methods. For instance, with the frame difference method, assuming that fij(t) is the representative pixel value at location (i,j) and at time t, the metric is given by either:             d      t        =                  ∑        ij            ⁢                           ⁢                                    (                                                            f                  ij                                ⁡                                  (                  t                  )                                            -                                                f                  ij                                ⁡                                  (                                      t                    -                    1                                    )                                                      )                    2                ⁢                                   ⁢        or                        d      t        =                  ∑        ij            ⁢                           ⁢              U        ⁡                  (                                    (                                                                    f                    ij                                    ⁡                                      (                    t                    )                                                  -                                                      f                    ij                                    ⁡                                      (                                          t                      -                      1                                        )                                                              )                        ≥                          T              d                                )                                    where U(x)=1 if the condition x is satisfied, and zero otherwise, and where Td is an appropriate threshold. In the first case, the magnitude of the difference is considered (sometimes in place of the square of the difference, the magnitude of the difference is also considered). In the second case, the number of pixels undergoing changes are picked up. A cut is determined if dt exceeds an appropriate threshold.        
One problem associated with using these interframe approaches for detecting scene changes, however, is that they cannot adequately cope with conditions such as noise, minor illumination changes and camera motion. When such conditions exist, the interframe difference will, in virtually all instances, be large, thereby resulting in a false alarm. To obviate this problem, several techniques have been proposed including the three and five frame scheme (see K. Otsuji, et al., “Video browsing using brightness data,” Proc. SPIE: Visual communications and Image Processing, pages 980-989, 1991) and a twin-comparison approach (see H. Zhang, et al., “Automatic partitioning of full motion video,” Multimedia Systems, 1:10-28, 1993), where a simple motion analysis is used to check whether an actual transition has occurred. This process, however, is typically computationally extensive. In addition, despite the corrections, these methods are not able to detect gradual changes.
Because of the extreme sensitivity of the interframe difference methods to object motion, illumination changes and camera motion, intensity histogram comparison techniques have been proposed (see, e.g., Y. Tonomura, “Video handling based on structured information for hypermedia systems,” Proc. ACM: Multimedia Information Systems, Singapore, pp. 333-344, 1991; and Y.Z. Hsu, et al., “New likelihood test methods for change detection in image sequences,” Computer Vision, Graphics and Image Processing, 26:73-106, 1984). In general, with such methods, cuts are detected using absolute sum of the difference of intensity histogram, i.e., if:             ∑      i        ⁢                   ⁢                                              H            i                    ⁡                      (            t            )                          -                              H            i                    ⁡                      (                          t              -              1                        )                                      ≥      T    h                  where Hi(t) is the count in the ith bin at time t and where Th is an appropriate threshold. This method assumes that brightness distribution is related to the image which changes only if the image changes. The histogram is likely to be minimally affected by small object motion and even less sensitive to noise, thereby resulting in a reduction of the amount of false alarms. In some instances, rather than using the above mentioned metric, either one of the following correlation metrics may be used:             d      t        =                  ∑        i            ⁢                           ⁢                                                  (                                                                    H                    i                                    ⁡                                      (                    t                    )                                                  -                                                      H                    i                                    ⁡                                      (                                          t                      -                      1                                        )                                                              )                        2                                              H              i                        ⁡                          (                              t                -                1                            )                                      ⁢                                   ⁢        or                        d      t        =                  ∑        i            ⁢                           ⁢                                    (                                                            H                  i                                ⁡                                  (                  t                  )                                            -                                                H                  i                                ⁡                                  (                                      t                    -                    1                                    )                                                      )                    2                                                                                      H                  i                                ⁡                                  (                  t                  )                                            +                                                H                  i                                ⁡                                  (                                      t                    -                    1                                    )                                                      )                    2                                        whereby a scene change will be detected if the above metrics exceed a predetermined threshold. Instead of analyzing the intensity, other conventional techniques have utilized the color space for video segmentation (see, e.g., U. Gargi et al., “Evaluation of video sequence indexing and hierarchical video indexing,” Proc. SPIE: Storage and Retrieval for Image and Video Databases, San Jose, pp.144-151, 1995; and S. Devadiga, et al., “Semiautomatic video database system,” Proc. SPIE: Storage and Retrieval for Image and Video Databases, San Jose, pp. 262-267, 1995.)                        
Another conventional metric that is utilized is known as the Kolmogorov-Smirinov test metric. This metric is based on the cumulative histogram sum (CHP) of the previous and the current frames:       d    t    =            MAX      j        ⁢                                              CHP            j                    ⁡                      (            t            )                          -                              CHP            j                    ⁡                      (                          t              -              1                        )                                                    where CHPj(t) is the cumulative histogram sum of the jth intensity value in frame t (see N. V. Patel, et al., “Compressed video processing for cut detection,” IEEE Proc. Vis. Image Signal Proc., 143:315-323, 1996).        
In addition to the above mentioned methods, model based methods have also been suggested to improve the cut detection performance. These methods attempt to model the distribution of the pixel differences for the various types of cuts that are possible, which is then used for scene change detection.
Virtually all the above conventional cut detection methods utilize uncompressed video for scene change detection. While several methods purportedly operate in the compressed domain, in reality, the methods select a subset of the DCT coefficients for each one of the MJPEG blocks and continue with the processing after the blocks are decompressed (see, e.g., F. Arman, et al., “Image processing on compressed data for large video databases,” Proc. ACM Multimedia, 1993; and E. Deadorff, et al., “Video scene decomposition with motion picture parser,” Proc. SPIE: Electronic Imaging, Science and Technology, 1994). Also, when MPEG video is used, only the I frames are considered, thereby reducing the problem to an equivalent MJPEG problem (see J. Meng, et al., “Scene change detection in MPEG compressed video sequence,” Proc. SPIE: Digital Video Compression Algorithms and Techniques, 2419:14-25, 1995). Motion vectors directly from the P and B frames of an MPEG video are rarely considered primarily because they are very unreliable for any real computation and, in any event, can be estimated from the I frames.
Shots and the associated cuts can be classified into several types as a convention for the directors and camera operators, such as:                Static shots: These are shots that are taken without moving the camera. This category includes close, medium, full and long shots. They all result in different types of interframe differences. For instance, close shots produce more changes than long shots and so on.        Camera moves: These shots include the classical camera movements i.e. zoom, tilt, pan etc. The interframe difference is obviously a function of the camera speed.        Tracking shots: In these shots, the camera tries to follow the target object. Again, the interframe difference depends on the relative motion between the camera and the object.        
Because different types of videos with different motion types result in different interframe changes, the criteria to detect cuts should be different when different types of shots are processed. Therefore, a method that utilizes soft thresholds that are adaptively selected based on the types of shots would be advantageous.
Another issue is that since films and videos have different frame rates, when film is converted to video, there is a need to compensate for this discrepancy. Typically, 3:2 pulldown is effected, whereby every other film frame is played slightly longer. This results in duplicate frames in the resulting video, which may result in false alarms if the cut method is based solely on interframe difference. This problem is particularly true for animations.
As for the different types of transitions, some of the most frequent ones are as follows:                Fade in: The incoming scene gradually appears starting from a blank screen;        Fade out: The complete reverse of the above;        Dissolve: The transition is characterized by a linear interpolation between the last frame of the outgoing scene and the first frame of the incoming scene;        Wipe: The incoming scene gradually replaces the outgoing scene from one end of the screen to the other; and        Flip: The outgoing scene is squeezed out from one corner to the other revealing the incoming scene.        
In addition to the above transition types, other types such as freeze-frame, blow up, move in, montage, etc. are also used. Another issue other than the scene-change type is the duration. While in theory the duration of the scenes in a video could be arbitrary, normally the duration is limited to a reasonable level based on the duration of the intermittent shots.
The majority of the above-mentioned conventional cut detection methods do not employ a global strategy for identifying cuts in video as they typically employ only one metric for identifying scene changes. In addition, such methods typically employ hard thresholds resulting in the successful identification of scene changes for only certain sections of a video or, at best, for only certain types of video. Accordingly, an improved system and method for accurately detecting scene changes, both abrupt and gradual, is highly desirable.