This invention relates to computation of kernel descriptors, and in particular to fast matching of image patches using fast computation of kernel descriptors.
The widespread availability of cheap hand-held cameras and video sharing websites has resulted in massive amounts of video content online. The ability to rapidly analyze and summarize content from such videos entails a wide range of applications. Significant effort has been made in recent literature to develop such techniques. However, the sheer volume of such content as well as the challenges in analyzing videos introduce significant scalability challenges, for instance, in applying successful “bag-of-words” approaches used in image retrieval.
Features such as STIP and HoG3D that extend image level features to the spatio-temporal domain have shown promise in recognizing actions from unstructured videos. These features discretize the gradient or optical flow orientations into a d-dimensional indicator vector δ(z)=[δ1(z), . . . , δd(z)] with
            δ      i        ⁡          (      z      )        =      {                            1                                                    if              ⁢                                                          ⁢                              ⌊                                                      d                    ⁢                                                                                  ⁢                                          θ                      ⁡                                              (                        z                        )                                                                                                  2                    ⁢                    π                                                  ⌋                                      =                          i              -              1                                                            0                          otherwise                    
Despite their success, these features are generally hand designed and do not generally utilize full information available in measuring patch similarity. In recent work, several efforts have been made to develop principled approaches to design and learn such low-level features. For example, a convolutional GRBM method has been proposed to extract spatio-temporal features using a multi-stage architecture. Also, a convolutional independent subspace analysis (ISA) network has been proposed to extract patch level features from pixel attributes.
These deep learning approaches are in effect mapping pixel attributes into patch level features using a hierarchical architecture. A two layer hierarchical sparse coding scheme has been used for learning image representations at the pixel level. An orientation histogram in effect uses a pre-defined d-dimensional codebook that divides the θ space into uniform bins, and uses hard quantization for projecting pixel gradients. Another scheme allows data driven learning of pixel level dictionaries, and the pixel features are projected to the learnt dictionary using sparse coding to get a vector W(z)=(w1(z), . . . , wd(z)). After pooling such pixel level projections within local regions, the first layer codes are passed to the second layer for jointly encoding signals in the region. The orientation histograms and hierarchical sparse coding in effect define the following kernel for measuring the similarity between two patches P and Q:
      K    ⁡          (              P        ,        Q            )        =                                          F            h                    ⁡                      (            P            )                          T            ⁢                        F          h                ⁡                  (          Q          )                      =                  ∑                  z          ∈          P                    ⁢                        ∑                                    z              ′                        ∈            Q                          ⁢                                            m              ~                        ⁡                          (              z              )                                ⁢                                    m              ~                        ⁡                          (                              z                ′                            )                                ⁢                                    Φ              ⁡                              (                z                )                                      T                    ⁢                      Φ            ⁡                          (                              z                ′                            )                                          where
Fh(P)=Σz∈P{tilde over (m)}(z)Φ(z) is the patch sum
{tilde over (m)}(z)=m(z)/√{square root over (Σz∈Pm(z)2+εg)} is the normalized gradient magnitude with εg a small constant, and
Φ(z)=δ(z) for HoG and Φ(z)=W(z) for hierarchical sparse coding.
Kernel descriptors have been proposed to generalize these approaches by replacing the product Φ(z)TΦ(z′) above with a match kernel k(z, z′) and allows one to induce arbitrary feature spaces Φ(z) (including infinite dimensional) from pixel level attributes. This provides a powerful framework for designing rich low-level features and has shown state-of-the-art results for image and object recognition.
A significant limitation of kernel descriptors is that kernel computations are generally costly and hence it is slow to extract them from densely sampled video patches.