The explosive growth of the usage of digital images and video in computer systems and networks has dramatically increased the need for multimedia database systems that effectively index, store, and retrieve image and video data based on their contents. Different techniques have been developed to better support the increased needs. Techniques for content-based image retrieval have been developed to enable users to retrieve images based on their visual similarities, while techniques for video shot boundary detection have been developed that aim to segment a continuous video sequence into visually consistent units, so that the sequence can be efficiently indexed and retrieved.
Video programs are generally formed from a compilation of different video segments. These segments are typically classified as either shots or scenes. For example, a scene is a place or setting where action takes place. A scene can be made up of one shot or many shots that depict a continuous action or event. A shot is a view continuously filmed by one camera without interruption. Each shot is a take. When additional shots of the same action are filmed from the same setup, the resulting shots are “retakes”. Therefore, each shot consists of a sequence of consecutive frames, i.e., images, generated during a continuous and uninterrupted operating interval from a single camera. For example, in motion pictures, a shot is a continuous series of frames recorded on film that is generated by a single camera from the time it begins recording until is stops. In live television broadcasts, a shot constitutes those images seen on the screen from the time a single camera is broadcast over the air until it is replaced by another camera. Shots can be joined together either in an abrupt mode (e.g., butt-edit or switch) in which the boundary between two consecutive shots (known as a “cut”) is well-defined, or through one of many other editing modes, such as, fade or dissolve which result in a gradual transition from one shot to the next. The particular transition mode that is employed is generally chosen by the director to provide clues about changes in time and place which help the viewer follow the progress of events.
There are known automatic video indexing methods which detect abrupt transitions between different shots. For example, U.S. Pat. No. 6,055,025 describes such a method. A “scene”, is commonly considered to be a sequence of frames with closely related contents conveying substantially similar information. There are cases were the camera is fixed, thereby producing “still shots”. However, in general, video programs are composed not only of still shots, but also “moving shots” (i.e., shots in which the camera undergoes operations such as pan, tilt and zoom). Consequently, because of camera motion the contents of a series of frames over an individual shot may change considerably, resulting in the existence of more than one shot in a given scene. Therefore, while boundaries between different shots are scene boundaries, such boundaries may be a subset of all the scene boundaries that occur in a video program since camera motion may produce inter-shot scene changes. Therefore, scenes can be defined as collections of shots, where a shot is a continuous set of frames without editing effects. The prior art also defines a scene as having multiple shots with the same theme, such as, for example, a dialog scene showing interchangeably the two actors involved in the dialog. However, there are exceptions to this definition, as there are movies having an opening with one long shot containing several scenes. For the purposes of the present invention, a scene typically consists of shots.
While prior art discloses different methods for finding the shot boundaries in a video program, none of the methods are accurate enough for video indexing purposes. One reason for this is that shots are many times spurious, that is, they are generated by artifacts, such as, camera flashes, that do not indicate a real change in scene information. Segmenting video into scenes is closer to the real information that is captured by the video.
For the purposes of the present invention, the term “uniform video segments” describes a collection of successive video frames for which a given visual property is uniform or approximately constant over a period of time. In particular, the present invention deals with color-based uniformity.
Color information is a very useful visual cue for video indexing. Typically, video uniform color segments are collections of successive frames which also have a “constant” overall color distribution. For example, in outdoor sports, such as, soccer or golf, there is a preponderance of the “green” and “blue” colors due to the presence of grass and sky in the video. If a color histogram is computed for these outdoor scenes, the “green” and “blue” bins will be prominent, that is, they will have the highest number of votes per bin.
A color superhistogram is generated by sequentially updating color histograms. One way of generating them is as follows. In videos, such as, in MPEG-1/2/4 the successive frames are organized as I-/-P/-B frames; these frames come in groups, such as, IBBBPBBBPBBBPBBB which are repeated. Color histograms are generated either for all frames or for selected frames. From a processing speed point of view one can subsample the frames in temporal domain. Either I frames or B frames are taken at a sampling rate. The color superhistogram is generated by combining the information of successive color histograms. This makes color superhistograms an important tool and feature to detect uniform color segments because it is a robust and stable color representation of a video. Generally, color information is fundamental for video indexing into uniform color segments. Hence, superhistograms are typically used to characterize video. Superhistograms can be used to identify genre, find program boundaries and also to generate visual summaries. For example, using program boundary detection, one can identify an episode of Seinfeld as opposed to a News program. Uniform color segment boundary detection of the present invention, however, enables the segmentation of a news program into individual story segments.
Therefore there is a need for an efficient and accurate method and system for detecting uniform color segment boundaries in a video program.