1. Field of Invention
The present patent document is directed towards systems and methods for generating and using optical flow-based features.
2. Description of the Related Art
Vision-based action recognition has wide application. For example, vision-based action recognition may be used in driving safety, security, signage, home care, robot training, and other applications.
One important application of vision-based action recognition is Programming-by-Demonstration (PbD) for robot training. In Programming-by-Demonstration, a human demonstrates a task that is desired to be repeated by a robot. While the human demonstrates the task, the demonstration process is captured in a video or videos by camera sensors. These videos are segmented into individual unit actions, and the action type is recognized for each segment. The recognized actions are then translated into robotic operations for robot training.
To recognize unit actions from video segments, reliable image features are extremely important. To be effective, the image features ideally should satisfy a number of criteria; such as, for example, they should be able to identify actions in different demonstration environments. Second, they should support continuous frame-by-frame action recognition. And, they should have low computational costs.
Prior attempts at feature matching include at least two types: temporal-template-based feature matching and local feature matching. Temporal-template-based feature matching includes such methods as moving object silhouettes, average flow frame, motion energy image, and motion history image. These methods typically work well for simple actions. However, they have some significant drawbacks. For example, they typically require object detection/background subtraction and time-wrapping to handle variable action duration for recognition. Such methods are also difficult to apply for continuous action recognition.
Local feature matching includes such methods as histogram of oriented optical flow (HOOF) and spatial-temporal interest point (STIP). These methods tend to have the benefit of being fast and more robust on dynamic backgrounds. However, these methods also tend to be extremely sparse for smooth actions. In fact, some actions do not produce distinctive features. Also, these methods tend to have large quantization error.
Accordingly, systems and methods are needed that provide improved image feature representation.