Content based video classification is fundamental to intelligent video analytics (IVA) and includes automatic categorizing, searching, indexing, segmentation, and retrieval of videos. It has been applied to a wide range of real world applications, for instance, multimedia event detection, semantic indexing, gesture control, etc. However, recognizing unconstrained videos is a challenging task because (i) an appropriate video representation can be task dependent, e.g., coarse (“swim” vs. “run”) or fine-grained (“walk” vs. “run”) categorizations, (ii) there may be multiple streams of information that need to be taken into account, such as actions, objects, scenes, and so forth, and (iii) there are large intra-class variations, which arise from diverse viewpoints, occlusions and backgrounds. As the core information of videos, visual cues provide the most significant information for video classification.
Recently, deep convolutional neural networks (CNN) have proven to be effective for action recognition and video classification. Although significant progress in recent years has been achieved in the development of feature learning by deep neural networks, it is clear that the features that are extracted by the neural networks do not have the same discriminative capability over all classes. Therefore, conventional video classification techniques adaptively combine a set of complementary features. The conventional techniques focus on short-term information because the representations of the complementary features are learned in short time durations. The short-term information is insufficient for video classification because complex events are better described by leveraging the temporal evolution of short-term contents. Consequently, there is no single and unified solution for all classes of videos. There is a need for addressing these issues and/or other issues associated with the prior art.