1. Field of Invention
The present patent document is directed towards systems and methods for segmentation and recognition of actions, including action transitions.
2. Description of the Related Art
Vision-based action recognition has wide application. For example, vision-based action recognition may be used in driving safety, security, signage, home care, robot training, and other applications.
One important application of vision-based action recognition is Programming-by-Demonstration (PbD) for robot training For Programming-by-Demonstration, a task to train is often decomposed into primitive action units. For example, in Programming-by-Demonstration, a human demonstrates a task that is desired to be repeated by a robot. While the human demonstrates the task, the demonstration process is captured by sensors, such as a video or videos using one or more cameras. These videos are segmented into individual unit actions, and the action type is recognized for each segment. The recognized actions are then translated into robotic operations for robot training.
To recognize unit actions from video segments, reliable image features are important. To be effective, the image features ideally should satisfy a number of criteria; such as, for example, they should be able to identify actions in different demonstration environments. Second, they should support continuous frame-by-frame action recognition. And, they should have low computational costs.
To segment and recognize unit actions from such videos, a proper classification algorithm is needed. The classifier used for such applications should satisfy several conditions. First, it should be able to model temporal image patterns by actions in controlled environments. Second, it should be capable of continuous action recognition. And third, it should be able to be generalized from small training data sets. Existing techniques typical do not satisfy all of these criteria.
Continuous action recognition is a challenging computer vision task because multiple actions are performed sequentially without clear boundary, which requires the analysis between action segmentation and classification to be performed simultaneously. In some simplified scenario, action boundary segmentation and action type recognition may be approached separately, or the temporal recognition problem may be converted into identifying representative static template, such as those that identify unique hand shape, hand orientation or hand location matching for hand gesture/sign language recognition.
For temporal pattern classification, existing works may be categorized into template based and model based. A template may consist of features extracted from “key-frame” or exemplar sequences, and the classification/matching process usually requires task-specific similarity measures. Compared to a template-based classifier, the model-based approach allegedly provides more flexibility and generality. Among popular model-based methods that are capable of continuous action recognition, Switching Linear Dynamic System (SLDS) has been offered to describe complex temporal patterns and has shown advantages over other methods, such as Hidden Markov Model, Conditional Random Field (CRF), Switched Autoregressive, and Gaussian process. In SLDS, a set of Linear Dynamic Systems (LDS) are respectively trained to model individual actions, and the transitions between these LDSs are inferred by higher-level statistics to recognize continuous actions. SLDS provides an intuitive framework for describing the continuous but non-linear dynamic of real-world motion, and is proven effective for texture analysis and synthesis, recognizing bee dancing and human actions with accurate motion representation, such as kinematic model parameters, joint trajectory, and joint angle trajectory.
However, SLDS applies the learned dynamics in individual actions to estimate the transition between multiple actions, which leads to at least three limitations. First, the transition of state sequence within an action has different patterns from that of between actions. Second, the action transition prior used by SLDS is only based on simple cooccurrence count between actions; that is, it is a scalar value only and once trained is independent of the observations. And third, in SLDS, it is difficult to distinguish between the state transition within an action primitive or between two repeating primitives. While the third limitation may be ignorable in some instances where an action can continue for any duration, sufficiently robust systems should be able to segment and classify repeated action primitives.
Accordingly, systems and methods are needed that provide more dynamic and improved temporal pattern classification.