Automated video analysis systems are often required to process streaming video which is transmitted or received in an incremental fashion, with a new frame being transmitted or received at some constant rate. An important preprocessing step in many such systems is video background modeling. Background modeling is a technique for extracting moving objects in video frames. More specifically, video background modeling consists of segmenting the moving objects or “foreground” from the static ones or “background”.
Streaming live video, such as in the case of sports, traffic, surveillance, etc., may utilize a popular method for video background modeling known as Principal Component Pursuit (PCP). The PCP optimization problem is defined by:
                                                                                                              arg                    ⁢                                                                                  ⁢                    min                                                                                                                    L                    ,                    S                                                                        ⁢                                                          L                                            *                                +                                    λ              ·                                                                  S                                                  1                                      ⁢                                                  ⁢                          s              .              t              .                                                          ⁢              D                                      =                  L          +          S                                    (        1        )            where Dεm×n is the observed ratio of n frames, each of size m=Nr×Nc×Nd (rows, columns, and depth or channels respectively), Lεm×n is a low-rank matrix representing the background, Sεm×n is a sparse matrix representing the foreground, ∥L∥* is the nuclear norm of matrix L (i.e., Σk|(σk(L)|)—the sum of the singular values of L, and ∥S∥1 is the 1 norm of S (seen as a long vector).
Typically, current methods used to solve the PCP optimization problem (1) are based on splitting methods, such as the Augmented Lagrange Multiplier (ALM) method or its variants, in which the PCP problem (1) is solved via the problem:
                                          arg            ⁢                                                  ⁢            min                                                            L            ,            S            ,            Y                                ⁢                          L                    *        +      λ    ·                          S                    1        +      〈          Y      ,              D        -        L        -        S              〉    +      0.5    ⁢          μ      ·                                              D            -            L            -            S                                    F            which includes a full or partial Singular Value Decomposition (SVD) depending on the ALM variant.
Certain current PCP-type methods include automated Recursive Projected CS (ReProCS), Grassmannian Robust Adaptive Subspace Tracking Algorithm (GRASTA), a smoothed p-norm Robust Online Subspace Tracking method (pROST) and Grassmannian Online Subspace Updates with Structured-sparsity (GOSUS). However, some of these include a batch initialization. Specifically, ReProCS is not a real-time algorithm, nor can it process real videos where multiple moving objects enter and leave the field of view of the camera. Moreover ReProCS assumes a known model for the motion of the video's moving objects, and uses a batch PCP method in its initialization step, which can be computationally costly.
GRASTA is presented as an “online” algorithm for low-rank subspace tracking: it uses a reduced number of frames compared to the PCP problem (1), to estimate an initial low-rank sub-space representation of the background and then processes each frame (which can be spatially sub-sampled) at a time. It must be emphasized that this procedure is not fully incremental, using a time sub-sampled version of all the available frames for initialization. Although GRASTA can estimate and track non-stationary backgrounds, its initialization step can have a relatively high complexity.
pROST is very similar to the GRASTA algorithm, but instead of using an 1 norm of the singular values to estimate the low-rank sub-space representation of the background it uses an p norm (p<1). It has been shown that pROST can outperform GRASTA in the case of dynamic backgrounds.
Similarly, GOSUS is also closely related to GRASTA, however GOSUS enforces structured/group sparsity on the sparse component and uses a small number of frames from the initial part of the video to be analyzed for its batch initialization stage, and then proceeds to update the background. Although GOSUS is known to have better tracking properties than GRASTA, its computational cost is higher. Furthermore computational results suggest that its complexity does not depend linearly with the number of pixel in the analyzed video frame, but it is influenced by the number of moving objects.
While PCP is currently considered to be a superior method for video background modeling, it suffers from a number of limitations including, high computational cost, batch processing, and sensitivity to camera jitter.
The high computational cost is dominated by a partial Singular Value Decomposition (SVD) computation at each major outer loop, with a cost of O(m·n·r) where r=rank(L).
Batch processing requires a large number of frames before any processing can begin including significant overhead of memory transfers due to the typical size of matrix D. For example, in the case of a 400 frame (13.3 seconds at 30 fps) 640×480 color video, the size of D is 921600×400, equivalent to 2.95 giga-bytes (Gb) in double floating-point representation. Likewise in the case of a 900 frame (36 seconds at 25 fps) 1920×1088 (HD) color video, the size of D is 6266880×900, equivalent to 45.12 Gb in double floating-point representation.
Furthermore, most PCP algorithms, either batch or online, have a high sensitivity to camera jitter which can affect airborne and space-based sensors as well as fixed ground-based cameras subject to wind. In this context, Robust Alignment by Sparse and Low-rank decomposition (RASL) and Transformed Grassmannian robust adaptive subspace tracking algorithm (t-GRASTA) are known to be robust to camera jitter, however RASL is a batch method whereas t-GRASTA is no fully incremental, needing a batch and computationally expensive initialization.
Moreover, RASL was introduced as a batch PCP method able to handle misaligned video frames by solving
                                                        arg              ⁢                                                          ⁢              min                                                                          L              ,              S              ,              τ                                          ⁢                          ⁢                                  L                          *              +                  λ        ·                                          S                                1                    ⁢                          ⁢              s        .        t        .                                  ⁢        τ            ⁢                          ⁢              τ        ⁡                  (          D          )                      =      L    +    S  where τ(•)={k(•)} is a set of independent transformations (one per frame), each having a parametric representation, such that τ(D) aligns all the observed video frames. RASL handles the non-linearity of the previous equation via
                                                        arg              ⁢                                                          ⁢              min                                                                          L              ,              S              ,              τ                                          ⁢                                  L                          *              +                  λ        ·                                          S                                1                    ⁢                          ⁢              s        .        t        .                                  ⁢                  τ          ⁡                      (            D            )                                +                  ∑                  k          =          1                n            ⁢                          ⁢                        J          k                ⁢                  Δτ          k                ⁢                  ɛ          k                      =      L    +    S  where Jk is the Jacobian of frame k with respect to transformation k and εk denotes the standard basis for real numbers. RASL's computational results mainly focus on rigid transformations; it is also known that as long as the initial misalignment is not too large, RASL effectively recovers the correct transformations.
Thus, there is a need for an improved, more efficient method for video background modeling than current existing methods. The invention satisfies this need.