Field
Embodiments presented herein provide techniques for detecting action in recorded video and, more specifically, a weakly-supervised structured learning technique for recognizing and localizing actions in video.
Description of the Related Art
The problem of jointly classifying and localizing human actions in video has typically been treated in the same manner as object recognition and localization in images. As used herein, “localizing” may include finding an enclosing spatio-temporal volume or spatio-temporal extent. Object recognition and localization in images cannot be easily extended to a temporal domain, as in the case of classifying actions in video. Some challenges may include: (1) dealing with motion of the actor within the frame, resulting from the camera or the actor's own motion in the world; (2) complexity of the resulting spatio-temporal search, which requires a search over the space of temporal paths; (3) needing to model coarse temporal progression of the action and action context; and (4) learning in absence of direct annotations for actor(s) position within the frame.
Traditional techniques for detecting action in videos typically use holistic bag-of-words (BoW) models, or models that search for sub-volumes that are axis-aligned or purely discriminative. Holistic BoW models techniques generally take different features and cluster them, and then attempt to find the frequency of “words” within a given video. Such models do not allow for spatial (and often temporal) localization of actions. Models that search for sub-volumes do allow for localization but largely assume a static subject and camera. Further, methods that allow localization typically require bounding-box annotations at training time.