Sorting through a video clip trying to locate certain contents or events is a tedious and time-consuming process. One must carefully watch through an entire video clip, which may or may not contain scenes of interest in every frame. The problem is more acute in the case of video surveillance where the surveyed scene is being video-captured non-stop for a long period of time. Furthermore, in commercial and public security monitoring, it often involves a network of hundreds of surveillance video cameras capturing multiple streams of infinite video. There are billions of surveillance video cameras mounted all over the world. In the southern city of China, Shenzhen alone, it is estimated that more than one million video cameras are deployed.
Therefore, there is a need for a way to summarize or condense a video clip to show only portions of the video that might possibly contain the desired contents. Some of the traditional video summarization techniques condense moving objects temporally and display the results in conventional two-dimensional motion pictures. But such condensed two-dimensional motion pictures can clatter the moving objects and make the visual context more than enough for human visual digestion. Other traditional video summarization techniques simply remove silent frames from the source video clip, which cannot achieve optimal summarization effect.