1. Technical Field
This invention relates to methods for video abstraction and segmentation, and most particularly to selecting parts from a video stream that represent significant portions of the overall stream.
2. Related Art
The general objective of video abstraction is to extract compact and meaningful aspects of content from a longer video program (e.g. a news broadcast, a feature movie, or a home video). The extracted content could be static: for instance a series of still images that could be combined into a storyboard, or dynamic: for instance a series of shorter video segments that could be combined into an abridged video clip. The aim is to preserve in the extracted content the inherent flavours and main messages of the original program. Video segmentation involves splitting a longer video program into a series of sub-programs which together include all the content of the longer program. The aim is for each sub-program (segment) to consist of a thematic unit of the longer program: for example each segment could consist of a single scene, shot or semantic event of the longer program.
Abstraction and segmentation involve similar technical problems. Once a video has been segmented into its constituent units, it is often then readily possible to decide how to extract representative frames for each constituent unit.
Automatic video abstraction is one of the most significant current topics in multimedia information management. As the volume of stored video data increases, and improvements in internet technology make increasing amounts of video more readily available, it is becoming increasingly important to be able to browse video data and quickly access specific content within a long video stream. An effective method of video abstraction would enable easy and fast content access and browsing of a large video database as well as facilitating content-based video indexing and retrieval. It would allow end users to interact remotely with selected video elements (e.g., to select and view a particular shot, event, or scene) as depicted in the structured storyboard and view the whole video stream in a non-linear fashion. This would be useful for accessing on-line video data and, for managing home videos and for sharing video data effectively among friends. It would be particularly appealing to users of devices such as mobile phones and PDAs (personal digital assistants) which, due to their limited communication bandwidth and small display size, make it difficult for a user to download and view a long video program.
There has been great interest in video abstraction over the last decade. Reference [12] is a review of general issues in the field. Most recently, efforts have been directed to improving automatic abstraction by designing algorithms that take account of generalized characteristics of specific types of video media. (See, for example, reference [17]). One example of such a type of video is the unstructured home video, which has been studied in references [4] and [15]. Another approach has been to study video summarization together with video segmentation, which can be cast as a data analysis problem to be solved within a data clustering framework. (See, for example reference [9]). Lo and Wang (reference [13]) used histogram-based method to clustering video sequences. A fuzzy clustering method has been used for video segmentation by Joshi et al.; see reference [10]. Recently-developed spectral clustering techniques have also been used for video segmentation; see reference [19].
Data clustering in a very high dimensional space encounters the problem known as the ‘curse of dimensionality’. Most video analysis occupies a high dimensional space because of the sheer number of pixels involved in a typical image. This is so even after some conventional feature extraction process. For example, when an 8×8×8 HSV color histogram (which is nowadays considered to be very coarse) is used to represent an image, the feature space will still have 512 dimensions. Currently there are two types of solution to the dimensionality problem. One is the use of approximate nearest neighbor search methods; see reference [8].
The other is the use of dimensionality reduction techniques; see references [2] and [16]. Among the latter, the traditional principal component analysis (PCA), linear discriminant analysis (LDA) and multidimensional scaling (MDS) methods have been well studied. Recently, the method known as locality preserving projection (LPP) has been proposed (see references [6] and [7]). This is a linear projection method. Research has also focused on the so-called spectral clustering method (see reference [14]), which falls into the class of graph-based methods. This class also includes the ISOMAP technique, locally linear embedding (LLE), Laplacian eigenmaps and so on, though these methods yield maps that are defined only on the training data points and it is unclear how to evaluate the maps on novel test data points.
In the field of image and video analysis, the one-dimensional projection methods mentioned above may be expected not work efficiently, because in image and video analysis' there is usually the need to compute a very high dimensional covariance matrix and related eigen-decomposition. Recently, Yang and Zhang (see reference [18]) proposed a method known as two-dimensional principal component analysis (2D-PCA).
The algorithms used for two-dimensional principal component analysis (2D-PCA) will now be discussed.
Let w be an n-dimensional unitary column vector, and Ai be an (m×n)-dimensional random image matrix. Now, projecting Ai onto w by the following linear projection, we obtain an m-dimensional unitary feature vector:xi=Aiw, i=1,2, . . . ,N  (1)
The 2D-PCA process seeks a projection direction w which maximizes the total scatter of the resulting projected feature vectors {xi}. To measure this, Yang et al. (reference [18]) chose to use the following criterion, called the generalized total scatter criterion:J(w)=tr(Sw)=wTGiw,  (2)where Sw is the covariance matrix of the projected feature vectors {xi} from training samples, and tr(Sw) is the trace of Sw. Gi is the image covariance (scatter) matrix:
                              G          i                =                              1            N                    ⁢                                    ∑                              j                =                1                            N                        ⁢                                                            (                                                            A                      j                                        -                                          A                      _                                                        )                                T                            ⁢                                                (                                                            A                      j                                        -                                          A                      _                                                        )                                .                                                                        (        3        )            
Then the set of optimal projection vectors, w1, w2, . . . , wd, of 2D-PCA are the orthonormlal eigenvectors of G1, which correspond to the first d largest eigenvalues. Note that each principal component of 2D-PCA is a vector, whereas the principal component of PCA is a scalar.
The principal component vectors obtained are used to form an m×d matrix Xi=[xi(1), xi(2), . . . , xi(d)], which is called the feature matrix or feature image of the image sample Ai.
This 2D-PCA method reduces computational complexity greatly compared with that of traditional PCA and also improves image/face recognition rates in the comparative studies carried out.
2D-PCA is relatively efficient for projecting images since the size of the covariance matrix is reduced to the square of the number of image columns. Yang and Zhang demonstrated that the 2D-PCA method is computationally more efficient than that of conventional PCA and achieves better performance for face recognition. Following the same thread, Li and Yuan (see reference [11]) proposed a method known as two-dimensional linear discriminant analysis (2D-LDA). Li and Yuan compared 2D-LDA with 2D-PCA and other one-dimensional projection methods and found that 2D-LDA gives better pattern discriminant performance.
Furthermore, a single-dimensional LPP (locality preserving projection) has been proposed. He et al. have proposed in reference [7] the following objective function, seeking to preserve the intrinsic geometry of the data and local structure:
                              min          ⁢                                    ∑              ij                        ⁢                                                            (                                                            y                      i                                        -                                          y                      j                                                        )                                2                            ⁢                              S                ij                                                    ,                            (        4        )            where yi=wTxi is the one-dimensional representation of original data vector xi (which can be an mn-dimensional vector representation of an m×n image) and the matrix S is a similarity matrix, which can be a Gaussian weighted or uniformly weighted Euclidean distance using a k-nearest neighbors or ε-neighborhood, i.e.,
                              S          ij                =                  {                                                                                                                exp                      ⁡                                              (                                                                                                                                                                                                            x                                  i                                                                -                                                                  x                                  j                                                                                                                                                    2                                                    /                          t                                                )                                                              ,                                                                                                                                                                                                                    x                            i                                                    -                                                      x                            j                                                                                                                      2                                        <                    ɛ                                                                                                                    0                    ,                                                                    otherwise                                                      ,                                                  ⁢            or                                              (        5        )                                          S          ij                =                  {                                                                                          exp                    ⁡                                          (                                                                                                                                                                                              x                                i                                                            -                                                              x                                j                                                                                                                                          2                                                /                        t                                            )                                                        ,                                                                                                                                                                                                      if                            ⁢                                                                                                                  ⁢                                                          x                              i                                                        ⁢                                                                                                                  ⁢                            is                            ⁢                                                                                                                  ⁢                            among                            ⁢                                                                                                                  ⁢                            k                                                    -                                                      NN                            ⁢                                                                                                                  ⁢                            of                            ⁢                                                                                                                  ⁢                                                          x                              j                                                                                                      ,                        or                                                                                                                                                                                                      x                            j                                                    ⁢                                                                                                          ⁢                          is                          ⁢                                                                                                          ⁢                          among                          ⁢                                                                                                          ⁢                          k                                                -                                                  NN                          ⁢                                                                                                          ⁢                          of                          ⁢                                                                                                          ⁢                                                      x                            i                                                                                                                                                                                                                    0                  ,                                                                              otherwise                  .                                                                                        (        6        )            
By imposing a constraint:yTDy=1wTXDXTw=1,where X=[x1, x2, . . . , xN], or an mn×N matrix, and D is a diagonal matrix whose entries are the column sums of S, Dii=ΣjSji. Let L=D−S, then the minimisation problem becomes:
                                                        arg              ⁢                                                          ⁢              min                                      w                                                                    w                    T                                    ⁢                                      XDX                    T                                    ⁢                  w                                =                1                                              ⁢                      w            T                    ⁢                      XLX            T                    ⁢          w                ,                            (        7        )            such that the optimal projection axis w is given by the minimum eigenvalue solution to the generalised eigenvalue problem:XLXTw=λXDXTw  (8)
He et al. pointed out the relationship between LPP and PCA, LDA: if the similarity matrix S is uniform weight Sij=1/N2, ∀i,j and take ε/k to be infinity, then choose the eigenvectors of XLXT associated with the largest eigenvalues, the data will be projected along the directions of maximal variance like PCA. If similarity matrix S is within class uniform weight, Sij=1/Nk if xi and xj both belong to the kth class otherwise 0, and if data are centralised, then LPP reduces to an LDA algorithm. In implementations of the present invention those conditions are preferably not satisfied.
As for LDA, the application of the LPP algorithm to face recognition is also confronted with the problem of singularity of XDXT. He et al. (reference [7]) chose to perform PCA prior to LPP, calling this method Laplacianfaces in reference to Fisherfaces (see reference [1]). However, this suffers from the problem that after performing the step of PCA it is unknown whether LPP will preserve local information of the original data. The inventors of the present invention have found the first several eigenvectors associated with the first several largest eigenvalues contain mainly lighting variation and face expression information. Experiments have shown that discarding the first three eigenvectors associated with the three largest eigenvalues actually tends to improve the recognition rate. Like LDA, LPP faces a problem of high computational complexity.
More recently Chen et al. derived the two-dimensional locality preserving projection (2D-LPP) method (see reference [3]) and applied it to face recognition tasks with notably favourable results.
There is a need for a method of abstracting video data that is more computationally efficient and/or more representative of the perceptive qualities of the underlying data than prior methods.