A temporal segment of a video is a continuous set of frames from frame f1 to frame f2, where frame f1 is temporally before frame f2; i.e., f1≤f2. Other terms such as temporal interval or time interval may also be used to refer to a temporal segment. The length of a temporal segment refers to the number of frames in that segment. Two temporal segments are called non-overlapping when there is no frame which belongs to both segments. Two non-overlapping temporal segments may also be called disjoint segments.
Fixed length segmentation is the act of segmenting the video sequence into temporal segments of a fixed non-zero length (e.g., 60 frames). Fixed length segmentation may be done with non-zero temporal overlap, in which case some frames could be part of two different segments. For example, when segmenting a video sequence into fixed length segments of 60 frames length with 50% temporal overlap, the first temporal segment includes frames 1 to 60, the second temporal segment would include frames 31 to 90, and so on.
The term action as used below refers to the act of doing something, such as ‘walking’, ‘kicking’, ‘cutting’, often in order to make something happen. The term action segment as used below refers to the temporal segment that contains an instance of an action of interest.
Temporal segmentation of an action, which may also be referred to as action localization, is the task of determining the temporal segment (i.e., action segment) that contains the action of interest. Thus, temporal segmentation of an action includes the two sub-tasks of finding the start and the end frames of the temporal segment and finding the action classification label associated with that segment.
A prior-art method for temporal segmentation of an action, called sliding window search, trains a classifier for the action of interest, using a given training set containing segmented instances of the action of interest. The trained classifier is then applied to a set of fixed length and often overlapping temporal segments of a new (unseen) video. The length of the segments (e.g., 100 frames), and the ratio of overlap (e.g., 25%), are predetermined. The segments containing the action of interest (if any) are then identified using non-max suppression which greedily selects the segments with the highest scores. Non-max suppression is a local maxima search with a predetermined threshold. A disadvantage of using a sliding window search is that the precision of localization depends on the resolution of the search and therefore the number of evaluated temporal segments. Also, as the final segmentation is done locally and using a greedy algorithm, the generated temporal segments are not jointly optimized.
Parsing videos of actions is the task of decomposing a video sequence into action segments, and is a very challenging task, since the number of constituent actions is not known a priori. Different instances of a same action may have very different durations; and different actions of interest may also have very different durations. For example, while repetitive actions like walking and running may last for a few seconds to many seconds, snap actions like kicking and falling may last only for a fraction of a second to a few seconds. Besides, human action recognition from videos is generally hard due to variation in size, scale, view-point, object deformation and occlusion. Also, without knowing the temporal segmentation, a part of one action (e.g., a stride in a walking action) may look similar to a different action (e.g., a kicking action).