The present application relates to systems and methods for automatic human action recognition.
Recognizing human actions in real-world environment finds applications in a variety of domains including intelligent video surveillance, customer attributes, and shopping behavior analysis. However, accurate recognition of actions is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations, etc. Therefore, most of the existing approaches make certain assumptions (e.g., small scale and viewpoint changes) about the circumstances under which the video was taken. However, such assumptions seldom hold in real-world environment. In addition, most of these approaches follow the conventional paradigm of pattern recognition, which consists of two steps in which the first step computes complex handcrafted features from raw video frames and the second step learns classifiers based on the obtained features. In real-world scenarios, it is rarely known which features are important for the task at hand, since the choice of feature is highly problem-dependent. Especially for human action recognition, different action classes may appear dramatically different in terms of their appearances and motion patterns.
Deep learning models are a class of machines that can learn a hierarchy of features by building high-level features from low-level ones, thereby automating the process of feature construction. Such learning machines can be trained using either supervised or unsupervised approaches, and the resulting systems have been shown to yield competitive performance in visual object recognition, natural language processing, and audio classification tasks. The convolutional neural networks (CNNs) are a type of deep models in which trainable filters and local neighborhood pooling operations are applied alternatingly on the raw input images, resulting in a hierarchy of increasingly complex features. It has been shown that, when trained with appropriate regularization, CNNs can achieve superior performance on visual object recognition tasks without relying on handcrafted features. In addition, CNNs have been shown to be relatively insensitive to certain variations on the inputs.
In 2D CNNs, 2D convolution is performed at the convolutional layers to extract features from local neighborhood on feature maps in the previous layer. Then an additive bias is applied and the result is passed through a sigmoid function. Formally, the value of unit at position (x,y) in the j th feature map in the i th layer, denoted as vijxy, is given by
                                          v            ij            xy                    =                      tanh            ⁡                          (                                                b                  ij                                +                                                      ∑                    m                                    ⁢                                                            ∑                                              p                        =                        0                                                                                              P                          i                                                -                        1                                                              ⁢                                                                  ∑                                                  q                          =                          0                                                                                                      Q                            i                                                    -                          1                                                                    ⁢                                                                        w                          ijm                          pq                                                ⁢                                                  v                                                                                    (                                                              i                                -                                1                                                            )                                                        ⁢                            m                                                                                                              (                                                              x                                +                                p                                                            )                                                        ⁢                                                          (                                                              y                                +                                q                                                            )                                                                                                                                                                                      )                                      ,                            (        1        )            where tanh (•) is the hyperbolic tangent function, bij is the bias for this feature map, m indexes over the set of feature maps in the (i−1) th layer connected to the current feature map, wijkpq is the value at the position (p,q) of the kernel connected to the k th feature map, and Pi and Qi are the height and width of the kernel, respectively. In the subsampling layers, the resolution of the feature maps is reduced by pooling over local neighborhood on the feature maps in the previous layer, thereby increasing invariance to distortions on the inputs. A CNN architecture can be constructed by stacking multiple layers of convolution and subsampling in an alternating fashion. The parameters of CNN, such as the bias bij and the kernel weight wijkpq, are usually trained using either supervised or unsupervised approaches.