Digital editing of a captured video footage has become a common step for movie post-production, mainly due to advances in the fields of computer graphics and computer vision. Video editing tasks vary from basic operations such as trimming, cutting, splitting and resizing video segments to more elaborate ones such as effects filters, editing of object textures, removing and adding objects in a video segment, among others.
A significant difference between video and still image editing is the requirement that the result has to be temporally consistent. Temporal consistency refers to a smooth transition between successive frames, coherent with the motion of the objects in the sequence. Due to this constraint, the editing of a video cannot be reduced to a series of independent image editing problems. The temporal interdependence imposed by the motion has to be taken into account.
Many approaches to video editing estimate motion trajectories from the video, and compute the edited video as the minimizer of an energy functional. In this context the video, or a region of interest (ROI) in it, is represented as a vector in RN where the number of variables N corresponds to the number of pixels in the ROI. For example for a rectangular ROI of width W, height H, T frames and the color is encoded using a 3 channels color space, e.g. RGB, we have N=3W HT. The edited video is then computed by the minimization of an energy functional E:RN→R with a suitable optimization tool. The energy functional is designed seeking that its minimizers have the “desired properties”. These properties are dictated by the specific editing task and certain general properties, such as temporal and spatial consistency.
In particular, we focus on video energy functionals having the following structure:
                              E          ⁡                      (            u            )                          =                                            ∑                              t                =                0                            T                        ⁢                                          E                t                e                            ⁡                              (                                  u                  t                                )                                              +                                    ∑                              t                =                0                                            T                -                1                                      ⁢                                          E                                  t                  ,                                      t                    +                    1                                                  tc                            ⁡                              (                                                      u                    t                                    ,                                      u                                          t                      +                      1                                                                      )                                                                        (                  Eq          .                                          ⁢          1                )            
Here uϵRN denotes the vectorized unknown video, t=0, . . . , T is the frame index, and ut represents the t-th frame of u (also as a vector). Equation (Eq. 1) states that the energy E can be decomposed as a sum of two types of terms.
The terms in the first summation consist of single-frame energies Ete(ut). Their specific form depends on the editing tasks. For example, single-frame editing energies like the following have been used often in the literature:
                                          E            t            e                    ⁡                      (                          u              t                        )                          =                                            ∑                              x                ∈                Ω                                                                                  ⁢                                          (                                                      u                    ⁡                                          (                                              x                        ,                        t                                            )                                                        -                                      f                    ⁡                                          (                                              x                        ,                        t                                            )                                                                      )                            2                                +                                    λ              p                        ⁢                                          ∑                                  x                  ∈                  Ω                                                                                              ⁢                                                                                                                                      ∇                                                                                                  ⁢                                                  u                          ⁡                                                      (                                                          x                              ,                              t                                                        )                                                                                              -                                              g                        ⁡                                                  (                                                      x                            ,                            t                                                    )                                                                                                                          p                                .                                                                        (                  Eq          .                                          ⁢          1.1                )            
Here Ω denotes the frame domain (typically a rectangle), xϵΩα is pixel location, i.e, u(x,t) is the grey or color level of the pixel located at x from the frame t of video u. ∇ is a discrete spatial gradient operator (for example using finite differences), λ, pϵR are parameters of the energy, f is a video and g is a vector field (for example the spatial gradient of a video); f and g are given, typically as the result of a previous processing step. The first summation is a quadratic attachment to the given video f and the second summation is an attachment in the p-norm to the discrete gradient g. As an example, a smoothing filter can be designed by setting f as the original video, and g=0. If p=2 the resulting smoothing is equivalent to a Gaussian blur of the original video. If p=1, the smoothing preserves edges. As another example, the energy can be also used to generate a “cartoon filter” by defining g as simplified version of the gradient of the original video, keeping only large gradients (associated to significant edges) and removing smaller gradients (associated to texture, details, etc). These examples are only given here to fix ideas. The specific form of the single-energy term Ete depends on the desired editing, and may not have the structure given in Eq.1.1, except for the fact that it only depends on frame t.
The terms in the second summation Et,t+1tc(ut,ut+1) couple pairs of contiguous frames. Their aim is to enforce the temporal consistency by penalizing some measure of the variation of the video along a motion trajectory. The specific form of the temporal consistency terms Ettc(ut,ut+1) depends on the choice of the temporal consistency criterion. Examples will be given bellow.
Without the temporal coupling enforced by the temporal consistency terms Ettc(ut,ut+1), the minimization of the resulting energy reduces to the minimizations of the single-frame terms Ete(ut). Each of these can be minimized independently for each frame. While this is attractive from a computational point of view since it allows for parallelization, there is no guarantee that the result will be temporally consistent. On the contrary, the temporal consistency terms couple all pairs of adjacent frames, which implies that the energy has to be minimized simultaneously over the whole video volume. This forbids frame parallelization. Furthermore, it is the often the case that the computational cost of minimizing an energy jointly over T frames is much higher than the T times the cost of minimizing the energy over a single frame (the minimization algorithm scales superlinearly or exponentialy with the number of variables).
Review of Temporally Consistent Video Editing
Temporally consistent video editing methods can be classified according to the motion model used. The vast majority of professional video editing software is based on parametric closed-form motion models. Parametric models work under assumptions made on the geometry of the scene. The most common case is to assume that the scene is piece-wise planar [25,14]. In professional movie post-production, there are several commercial software programs that allow a visual effects artist to select a planar region which is then tracked automatically by the software. Examples are mocha [22], or Nuke's planar tracker [11]. This model permits the computation of a simple mapping between any pair of frames which can then be used to propagate information from one frame to another. When an object in the editing domain is not planar, the artist needs to segment it into pieces that can be approximated by a plane, and attach a planar tracker to each of them. This process takes time and the result often needs retouching to remove any seems between the different trackers.
On the other hand, non-parametric models do not make assumptions on the geometry of the scene. These models usually estimate the motion in the sequence by the optical flow. There has been in recent years a considerable progress in optical flow computation. For example, state-of-the-art optical flow algorithms are able to deal with some large displacements and allow for sharp discontinuities in the movement. This is the case for [21,8,6,2] to name a few. These methods still suffer from the “aperture” problem: the component of the motion vector tangent to the image level line cannot be estimated. In practice, to alleviate this problem a smoothness term is incorporated. The smoothness term causes a filling-in effect leading to dense flow fields, even if the aperture problem is present.
In the following, the state-of-the-art of temporally consistent video editing based on optical flow are reviewed. Although several optical flow effects have been used in professional movie post-production [19] its use for temporally consistent video editing is still marginal compared to the widespread use of planar trackers.
Examples of Energy Terms with Temporal Consistency
In this section, some models for temporal consistency that have been used in the literature are presented.
Throughout the text symbols with boldface will be used to indicate vector valued quantities and matrices. Non-boldface symbols will indicate scalar valued quantities. Let us note that no distinction will be made when discussing examples of 1D videos, and in these cases non-boldface symbols will be used.
It is considered a continuous spatio-temporal domain Ω×[0,T] where Ω⊂2 is a rectangular domain, and T>0, and editing domain 0⊂Ω×[0,T] with a smooth boundary. In some places in the text, to avoid cluttered equations, it will be used ΩT as a notational shorthand for the video domain Ω×[0,T]. It is denoted temporal “slices” of 0 by 0t={xϵΩ(x,t)ϵ0}. Similarly, temporal slices of Ω×[0,T] are denoted by Ωt:tϵ[0,T] representing the frames of the continuous video. An illustration of these domains can be seen in FIG. 1.
Let u:Ω×[0,T]→ be a given scalar video and let v:Ω×[0,T−1]→2 be the corresponding motion field. The value of the motion field at (x,t)ϵΩ×[0,T−1], v(x,t) represents the velocity of the projection of a particle in the 3D scene onto the image plane [12]. The trajectory of the particle can be obtained by solving the following ordinary differential equation (ODE):
                                          dx            dt                    ⁢                      (            t            )                          =                  v          ⁡                      (                                          x                ⁡                                  (                  t                  )                                            ,              t                        )                                              (        2        )            Where tϵ[0,T]. For simplicity it is assumed in this chapter that the functions can be differentiated as many times as needed.