Identification and classification of observed actions is of interest in machine learning applications as it provides an efficient mechanism for training robots to perform actions. For example, a robot can observe a person's interaction with an object and identify and store the actions that comprise the interaction. The stored data can then be used by the robot to later replicate the observed interaction. Some interactions, commonly described as compound actions, are comprised of combinations of simple actions performed sequentially. These composite actions are difficult to identify and classify for later performance by the robot. In particular, modeling, and recognizing composite actions is cumbersome as it requires data for each of many possible composite actions to be collected and stored.
Commonly, a robot uses a database including information about observed interactions to later perform those observed interactions. For example, the database includes data about actions which can be performed on an object, such as “pick up,” “lift,” “throw,” “break,” “give,” etc. Including information about composite actions in such a database requires extensive training data and storage space because composite actions comprise multiple simple actions performed sequentially. However, data for simple actions, such as “push,” “pull,” “lift,” “drop,” and “put” can be readily collected and stored because of the limited number of simple actions and reduced amount of training data needed to accurately describe simple actions. Hence, it is more efficient to store data describing simple actions and to then determine the combination of stored simple actions corresponding to an observed interaction. This allows composite actions to be reproduced by sequentially performing the simple actions comprising a composite action.
Existing techniques for action recognition, as described in Fern, A., et al., “Specific-to-general learning for temporal events with application to learning event definitions from video,” Journal of Artificial Intelligence Research, vol. 17, pp. 379-449 which is incorporated by reference herein in its entirety, hard-code algorithms into a system and determine support, contact and attachment relations between objects by applying the coded algorithms to video sequences. However, this technique has limited applicability as using hard-coded algorithms limits the types of actions that can be recognized. Another technique, as described in Rao, C., “View-invariant representation and recognition of actions,” International Journal of Computer Vision, 50(2):203-226, which is incorporated by reference herein in its entirety, recognizes composite actions by representing speed and direction of hand trajectory; however, this technique cannot recognize some types of actions, such as those with minimal hand movement (e.g., “drop”) and cannot model interaction between an object and a hand. An alternative technique, as described in Philipose, M., et al., “Inferring activities from interactions with objects,” IEEE Pervasive Computing, pp. 50-57, which is incorporated by reference herein in its entirety, recognizes actions by tracking Radio-Frequency Identification (RFID) tags on a person's hand and objects; however, RFID-based approaches are invasive and require mounting transmitters and sensors on the observed person and objects. These existing approaches can only recognize a limited number of composite actions and cannot be readily modified to identify additional composite actions.
What is needed is a system and method for identifying and classifying observed actions using stored data associated with simple actions.