Automated recognition of human actions in video clips has many useful applications, including surveillance, health care, human computer interaction, computer games, and telepresence. In general, a trained action classifier (model) processes the video clips to determine whether a particular action takes place.
To learn an effective action classifier model, previous approaches rely on a significant amount of labeled training data, i.e., training labels. In general, this works well for one dataset, but not another. For example, the background, lighting, and so forth may be different across datasets.
As a result, to recognize the actions in a different dataset, heretofore labeled training data approaches have been used to retrain the model, using new labels. However, labeling video sequences is a very tedious and time-consuming task, especially when detailed spatial locations and time durations are needed. For example, when the background is cluttered and there are multiple people appearing in the same frame, the labelers need to provide a bounding box for every subject together with the starting/ending frames of an action instance. For a video as long as several hours, the labeling process may take on the order of weeks.