Videos are available from a large number of sources, such as entertainment, sports, news, web, personal, home, surveillance, and scientific cameras. It is a problem to analyze and summarize such a wide variety of video content to locate interesting objects, events and patterns.
Automatic methods are known for extracting meta-data from videos. For example, video processing methods can detect, recognize, and track objects, such as people and vehicles. Most of those methods are designed for a particular genre of videos, for example, sports videos or newscasts. However, the number objects in a typical video can be large, and not all objects are necessarily associated with interesting events. Because of the complexity and variety of videos, it is often still necessary for users to further view videos and extracted meta-data to gain a better understanding of the context, i.e., spatial and temporal characteristics of the videos.
Typically, the videos or meta-data are manually processed either in space or alternatively in time. For example, a user might select a surveillance video acquired from a particular location, and then examine that video in detail only for a specific time interval to locate interesting events. Obviously, this requires some prior knowledge of the content, as well as the temporal dynamics of the scene in the video. It would be better if temporal and spatial characteristics of videos could be examined concurrently and automatically.
Therefore, there is a need for a method to analyze and manipulate videos and meta-data extracted from videos temporally and spatially by the automatic techniques. In particular, it is desired to provide improved visualization techniques that can reduce the amount of effort required to process videos and meta-data manually.
Video Summarization
Rendering video in space and time requires large processing, storage, and network resources. Doing this in real time is extremely difficult. Therefore, summaries and abstracts of the videos are frequently used. Video summarization is typically the first step in processing videos. In this step, features in the video are extracted and indexed. In general, summarization includes two processes: temporal segmentation and key-frame abstraction. Temporal segmentation detects boundaries between consecutive camera shots. Key-frame abstraction reduces a shot to one or a small number of indexed key-frames representative of the content. The key-frames can subsequently be retrieved by queries, using the index, and displayed for browsing purposes.
Commonly, the segmentation and abstraction processes are integrated. When a new shot is detected, the key-frame abstraction process is invoked, using features extracted during the segmentation process. The challenge is to generate a video summary automatically based on the context.
Summarization methods can either rely on low-level features such as color, e.g., brightness and dominant color, and motion activity, or more advanced semantic analysis such as object and pattern detection. While those methods are useful and powerful, they are inadequate for real-time analysis.
Video Visualization
Visualization methods commonly combine video frames and extracted information, e.g., meta-data. The meta-data are usually associated with individual frames, and can include bounding boxes for objects and motion vectors.
One method arranges a spatial view in the center of a display and temporal summarized views on the sides to simulate a 3D visualization that gives a sense of the past and future, Mackay, W. E., and Beaudouin-Lafon, M., “DIVA: Exploratory Data Analysis with Multimedia Streams,” Proceedings CHI'98, pp. 416-423, 1998. That method requires a large display screen and meta-data, and can only visualize data from one temporal direction, e.g., either the past or the future.
In another method, the user selects key-frames, and the video is presented temporally as a ‘time tunnel’ to quickly locate interesting events, Wittenburg, K., Forlines, C., Lanning, T., Esenther, A., Harada, S., Miyachi, T, “Rapid Serial Visual Presentation Techniques for Consumer Digital Video Devices”, Proceedings of UIST 2003, pp. 115-124, and [CS-3124]. While that method provides convenient presentation and navigation methods for videos, that method does not support queries.
Another method summarizes a video using opacity and color transfer functions in volume visualization, Daniel, G., and Chen, M. “Video visualization,” Proceedings IEEE Visualization 2003, pp. 409-416, 2003. The color transfer functions indicate different magnitudes of changes, or can be used to remove parts of objects. That method also requires meta-data.
Another method presents spatial relationships over time in a single image, Freeman, W., and Zhang, H. “Shape-Time Photography,” Proceedings of CVPR 2003, pp. 2003. That method requires a stationary stereo camera.
Another method presents the user with a detailed and view using successive timelines, Mills, M., Cohen, J. and Wong, Y. “A Magnifier Tool for Video Data,” SIGCHI '92: Proceedings of Human Factors in Computing Systems, Monterey, C A, pp. 93-98, 1992. The method uses a timeline to represent a total duration of a video, and the user can select a portion of the timeline and expand the selected portion to a second timeline. The timelines provide an explicit spatial hierarchical structure for the video.
Still Image Visualizations
Many techniques are used to visualize and understand still images. One method provides a ‘see-through’ interface called a ‘magic lens’, Bier, E. A., Fishkin, K., Pier, K., and Stone, M. C. “Toolglass and magic lenses: the see-through interface,” Proceedings of SIGGRAPH'93, pp. 73-80, 1993. A magic lens is applied to a screen region to semantically transform the underlying content as expressed by the pixels of the still image. The user moves the lens to control what region is affected. In practice, the magic lens is a composable visual filter that can be used for interactive visualization. The lens can act as a magnifying glass, zooming in on the underlying content. The lens can also function as an ‘x-ray’ tool to reveal otherwise hidden information. Multiple lenses can be stacked on top of one another to provide a composition of the individual lens functions. Different lens ordering can generate different result. The magic lens reduces screen space and provides the ability to view the entire context and detail of the still image concurrently. The lens enhances interesting information while suppressing distracting information.
The magic lens has been used in many applications, Bederson, B. B., and Hollan, J. “Pad++: a zooming graphical interface for exploring alternate interface physics,” Proceedings of UIST '94, pp. 17-26, 1994, Hudson, S., Rodenstein, R. and Smith, I. “Debugging Lenses: A New Class of Transparent Tools for User Interface Debugging,” Proceedings of UIST'97, pp. 179-187, and Robertson, G. G., and Mackinlay, J. D. “The document lens,” UIST'93, pp. 101-108, 1993.
However, the prior art magic lenses operate only in the spatial domain. It is desired to provide a magic lens that can operate concurrently in both the spatial and the temporal domain.