While spatial image warping is extensively used in image and video editing applications for creating a wide variety of interesting special effects, there are only very primitive tools for manipulating the temporal flow in a video. For example, tools are available for temporal speeding up (slowing down) of the video comparable to image zoom, or the “in-out” video selection comparable to image crop and shift. But there are no tools that implement the spatio-temporal analogues of more general image warps, such as the various image distortion effects found in common image editing applications.
Imagine a person standing in the middle of a crowded square looking around. When requested to describe his dynamic surrounding, he will usually describe ongoing actions. For example—“some people are talking in the southern corner, others are eating in the north”, etc. This kind of a description ignores the chronological time when each activity was observed. Owing to the limited field of view of the human eye, people cannot take in an entire panoramic scene in a single time. Instead, the scene is examined over time as the eyes are scanning it. Nevertheless, this does not prevent us from obtaining a realistic impression of our dynamic surroundings and describing it.
The space-time volume, where the 2D frames of a video sequence are stacked along the time axis was introduced as the epipolar volume by Bolles et al. [2, 4], who analyzed slices perpendicular to the image plane (epipolar plane images) to track features in image sequences.
Light fields are also related to the space-time volume: they correspond to 4D subsets of the general 7D plenoptic function [17], which describes the intensity of light rays at any location, direction, wavelength, and time. Light field rendering algorithms [18] operate on 4D subsets of the plenoptic function, extracting 2D slices corresponding to desired views. The space-time volume is a 3D subset of the plenoptic function, where two dimensions correspond to ray directions, while the third dimension defines the time or the camera position.
Multiple centers of projection images [19] and multiperspective panoramas [30] may also be considered as two-dimensional slices through a space-time volume spanned by a moving camera.
Klein et al. [8, 9] also utilize the space-time volume representation of a video sequence, and explore the use of arbitrary-shaped slices through this volume. This was done in the context of developing new non-photorealistic rendering tools for video, inspired by the Cubist and Futurist art movements. They define the concept of a rendering solid, which is a sub-volume carved out from the space-time volume, and generate a non-photorealistic video by compositing planar slices which advance through these solids.
Cohen et al. [6] describe how a non-planar slice through a stack of images (which is essentially a space-time volume) could be used to combine different parts from images captured at different times to form a single still image. This idea was further explored by Agarwala et al. [1]. Their “digital photomontage” system presents the user with a stack of images as a single, three-dimensional entity. The goal of their system is to produce a single composite still image, and they have not discussed the possibilities of generating dynamic movies from such 3D image stacks. For example, they discuss the creation of a stroboscopic visualization of a moving subject from a video sequence, but not the manipulation of the video segment to produce a novel video.
Video textures [Kwatra et al. [10]] and graphcut textures [Schödl et al. [15]] are also related to this work, as they describe techniques for video-based rendering. Schödl et al. generate new videos from existing ones by finding good transition points in the video sequence, while Kwatra et al. show how the quality of such transitions may be improved by using more general cuts through the space-time volume.
The above-mentioned publications are not directed to meaningful ways in which the user may specify and control various spatio-temporal warps of dynamic video sequences, resulting in a variety of interesting and useful effects.
While it is known to process a sequence of video image frames by using video content from different frames and merging such content so as to create a new frame, known approaches have mostly focused on producing still images using photo-montage techniques or have required translation of the camera relative to the scene.
1. Related Work
The most popular approach for the mosaicing of dynamic scenes is to compress all of the scene information into a single static mosaic image. The description of scene dynamics in a static mosaic varies. Early approaches eliminated all dynamic information from the scene, as dynamic changes between images were undesired [16]. More recent methods encapsulate the dynamics of the scene by overlaying several appearances of the moving objects into the static mosaic, resulting in a “stroboscopic” effect [1].
An attempt to incorporate the panoramic view with the dynamic scene was proposed in [20]. The original video frames were played on top of the panoramic static mosaic, registered into their location in the mosaic. The resulting video is mostly stationary, and motion is visible only at the location of the current frame.
The present invention addresses the problem of generating the impression of a realistic panoramic video, in which all activities take place simultaneously. The most common method to obtain such panoramic videos is to equip a video camera with a panoramic lens [21]. Indeed, if all cameras were equipped with a panoramic lens, life could have been easier for computer vision. Unfortunately, use of such lens is not convenient, and it suffers from many quality problems such as low resolution and distortions. Alternatively, panoramic videos can be created by stitching together regular videos from several cameras having overlapping field of view [22]. In either case, these solutions require equipment which is not available for the common video user.
In many cases a preliminary task before mosaicing is motion analysis for the alignment of the input video frames. Many motion analysis methods exist, some offer robust motion computation that overcome the presence of moving objects in the scene [3, 16]. A method proposed by [13] allows image motion to be computed even with dynamic texture, and in [7] motion is computed for dynamic scenes.