In computer vision and camera surveillance applications, a frequent problem is recognizing and detecting certain actions performed by objects such as people, machinery, vehicles, robots, etc. There has been a fair amount of work on the general problem of analyzing actions in videos, but most of the prior art work has focused on action recognition rather than on action detection.
Action recognition refers to classifying, i.e., recognizing, which action is being performed in a video segment that has been temporally trimmed so that the segment starts at or near the beginning of an action and ends at or near the end of the action. We use the term temporally trimmed to refer to such video segments. Action detection refers to a temporal or spatio-temporal localization of every occurrence of each action from a known set of action classes occurring in a long, i.e., not temporally trimmed, video sequence.
Early work on action detection includes methods that detect walking people by analyzing appearance and motion patterns. Several methods are known for detecting actions using spatio-temporal interest points, multiple instance learning, or part-based models.
Related to action recognition is the task of activity recognition. In an activity recognition task, a video segment that depicts an activity, such as a particular sport being played, is analyzed, and the goal is to determine which activity (e.g., which sport) is depicted in the video.
Fine-grained action detection refers to action detection in which the differences among the classes of actions to be detected are small. For instance, in a cooking scenario, detecting actions from a set that includes similar actions such as chopping, grating, and peeling is an example of fine-grained action detection.
Conventional methods for video analysis tasks, such as action recognition, event detection, and video retrieval, typically use hand-crafted features, such as Histogram of Oriented Gradients (HOG), Motion Boundary Histogram (MBH), and Histogram of Optical Flow (HOF). One method computes Improved Dense Trajectories (IDT) on each input video, then computes a Fisher vector for the video and performs classification using a support vector machine (SVM). In fact, shallow architectures using Fisher vectors yield good results for action and activity recognition.
The results can be improved when hand-crafted features such as the ones mentioned above are replaced by “deep” features that are determined by neural networks. The input to the neural networks can includes images and stacked optical flow along trajectories. One method uses a two-stream network, in which images (a first stream) and stacked optical flow fields that are determined over a small number of images (a second stream) are input to a deep neural network for action recognition. A similar architecture can be used to incorporate spatial localization into the task of action recognition in temporally trimmed videos. However, these networks do not learn long-term sequence information from videos.
Recurrent Neural Networks
Because recurrent neural networks (RNNs) can learn long-term sequence information in a data-driven manner, RNNs have been used for action recognition. A 3D convolutional neural network followed by a Long Short-Term Memory (LSTM) classifier can be used for action recognition. LSTMs can improve performance over a two-stream network for action recognition. Bi-directional LSTMs have been used to recognize actions from a sequence of three-dimensional human joint coordinates.
For action recognition, methods that use deep neural networks and LSTMs for action recognition perform only slightly better than methods that use shallow Fisher vectors generated from hand-crafted features.
Although substantial progress has been made in action recognition, not as much work has been done on action detection, i.e., temporal or spatio-temporal localization of actions in longer videos that are not temporally trimmed. Tracking has been used to help with spatial localization of actions in sports videos. There, proposed trajectories are generated, and then hand-crafted features are determined over the trajectories.
Using annotations for the objects being interacted with, or enforcing a grammar over the high-level activities being performed is generally helpful, although those techniques can require learning extra detectors for objects and having prior knowledge about the high-level activities.
For fine-grained action detection, extracting dense trajectories from spatio-temporal regions of interest or using trajectories of a person's hands can significantly improve performance.
One of the main deficiencies of prior-art methods for automatic analysis of actions in a video is a lack of focus on action detection. Instead, most prior methods focus on action recognition, which means that most methods cannot localize an action temporally or spatio-temporally. This may be because action recognition is an easier problem than action detection.
However, action recognition has much less practical value than action detection, because to temporally trim a video segment to include just a single action, which is a prerequisite for action recognition, the action must already be detected before the video is trimmed. Temporally untrimmed videos are much more common in real applications.
Another deficiency of prior-art methods for action detection is a relatively low accuracy. That is, the performance of prior-art action detection methods is not good enough for most computer vision applications.