In computer vision and camera surveillance applications, a frequent problem is recognizing and detecting certain actions performed by objects such as people, machinery, vehicles, robots, etc. There has been a fair amount of work on the general problem of analyzing actions in videos, but most of the prior art work has focused on action recognition rather than on action detection.
Action recognition refers to classifying, i.e., recognizing, which action is being performed in a video segment that has been temporally trimmed so that the segment starts at or near the beginning of an action and ends at or near the end of the action. We use the term temporally trimmed to refer to such video segments. Action detection refers to a temporal or spatio-temporal localization of every occurrence of each action from a known set of action classes occurring in a long, i.e., not temporally trimmed, video sequence.
Related to action recognition is the task of activity recognition. In an activity recognition task, a video segment that depicts an activity, such as a particular sport being played, is analyzed, and the goal is to determine which activity (e.g., which sport) is depicted in the video.
Fine-grained action detection refers to action detection in which the differences among the classes of actions to be detected are small. For instance, in a cooking scenario, detecting actions from a set that includes similar actions such as chopping, grating, and peeling is an example of fine-grained action detection. However, at least one deficiency of prior-art methods for action detection is their relatively low accuracy. That is, the the performance of prior-art action detection methods are not good enough for most computer vision applications, among other applications.
The standard pipeline for most video analysis tasks such as action recognition, event detection, and video retrieval was to compute hand-crafted features, such as Histogram of Oriented Gradients (HOG), Motion Boundary Histogram (MBH), and Histogram of Optical Flow (HOF). Conventional approaches rely on computationally expensive input representations such as improved dense trajectories or dense optical flow, create a Fisher vector for each video clip, then perform classification using support vector machines. However, at least one main drawback of the above previous approaches to action detection/recognition, among many drawbacks, is that these approaches rely on input representations and intermediate representations that are very time-consuming to compute and require a huge amount of memory to store. This makes such conventional methods impractical for real-world action detection applications.
Therefore, there is a need for developing action detection methods that can detect actions in a video efficiently, both in terms of time and memory requirements.