The following relates to video camera-based systems to video classification, processing and archiving arts, and related arts and finds particular application in connection with a system and method for generating a representation of a video which can be used for classification.
Video classification is the task of identifying the content of a video by tagging it with one or more class labels that best describe its content. Action recognition can be seen as a particular case of video classification, where the videos of interest contain humans performing actions. The task is then to label correctly which actions are being performed in each video, if any. Classifying human actions in videos has many applications, such as in multimedia, surveillance, and robotics (Vrigkas, et al. “A review of human activity recognition methods,” Frontiers in Robotics and AI 2, pp. 1-28 (2015), hereinafter, Vrigkas 2015). Its complexity arises from the variability of imaging conditions, motion, appearance, context, and interactions with persons, objects, or the environment over time and space.
Existing algorithms for action recognition are often based on statistical models learned from manually labeled videos. They use models relying on features that are hand-crafted for action recognition or on end-to-end deep architectures, such as neural networks. These approaches have complementary strengths and weaknesses. Models based on hand-crafted features are data efficient, as they can easily incorporate structured prior knowledge (e.g., the relevance of motion boundaries along dense trajectories (Wang, et al., “Action recognition by dense trajectories,” CVPR, (2011), hereinafter, Wang 2011). However, their lack of flexibility may impede their robustness or modeling capacity. Deep models make fewer assumptions and are learned end-to-end from data (e.g., using 3D-ConvNets (Tran, et al., “Learning spatiotemporal features with 3D convolutional networks,” CVPR, (2014), hereinafter, Tran 2014). However, they rely on handcrafted architectures and the acquisition of large manually labeled video datasets (Karpathy, et al., “Large-scale video classification with convolutional neural networks,” CVPR, (2014), a costly and error-prone process that poses optimization, engineering, and infrastructure challenges.
There remains a need for a system and method that provides improved results for video classification.