1. Field of the Invention
The present invention relates to indexing video sequences and, more specifically, to systems and methods for indexing audio/video sequences into episodes and highlights.
2. Background Information
With motion pictures and home movies, there is a desire to break long sequences of video into segments to, for example, catalogue their content and index their location on the video sequence. This cataloging and indexing allows specific scenes and events within the video sequence to be quickly located. The issue of long sequences of video arises more frequently in home videos than in professionally-produced videos because the latter are often created in smaller, edited sequences.
Video sequences can be segmented into shorter video segments, known as “shots.” The start and end of video shots are delineated by camera breaks, which are the turning on and the turning off of the camera. That is, the turning on of the camera signifies the start of a shot, and the turning off of the camera signifies the end of a shot. These issues are discussed in more depth in Gulrukh Ahanger, et al., “A Survey of Technologies for Parsing and Indexing Digital Video,” Journal of Visual Communication and Image Representation, March 1996, at 28-43, the contents of which are incorporated herein by reference. Various methods for audio-based classification of a video sequence are disclosed in Kenichi Minami, et al., “Video Handling with Music and Speech Detection,” IEEE Multimedia, July-September 1998, at 17-25 and Tong Zhang and C.-C. Jay Kuo, Content-Based Audio Classification and Retrieval for Audiovisual Data Parsing, 69-81 (Kluwer Academic Publishers 2001), the contents of which are incorporated herein by reference.
One technique by which the camera breaks within a video sequence can be detected for dividing the video sequence into shots is through the use of video frame histograms. Each frame making up a video sequence can be reduced to a pixel-level histogram. That is, each pixel in the frame is added to a particular column of the histogram based on the color of the pixel as matched against a color pallet where each color is associated with a number in a range from 0 to 255. The respective histograms for successive pairs of frames in the sequence are compared, and if the difference between two successive histograms exceeds a particular maximum, a scene or event change is presumed to have occurred, and a new shot is denoted. This technique is discussed in more detail in HongJiang Zhang, et al., “Developing Power Tools for Video Indexing and Retrieval,” 2185 SPIE 140-149 (8/94) and HongJiang Zhang, et al., “Automatic Partitioning of Full-Motion Video,” Institute of Systems Science, National University of Singapore, 10-28 (1993), the contents of which are incorporated herein by reference.
Shots defined using existing techniques tend to be very long, especially when these techniques are applied to home videos. Home videos are frequently taken of a single event, such as children playing in the back yard or a wedding. In such videos, the camera can be left running for an extended period of time (for example, to record a child playing or a sports event). In addition, the background is often the same, such that histograms of successive frames are often similar. Therefore, long sequences of video are retained and viewed in their entirety to locate a desired scene or event.