With the enormous amount of video content generated or shared by people using various electronic devices (e.g., smart phones, digital cameras, and digital camcorders), there has been a pressing need to automatically discover semantic information, such as certain actions, from untrimmed videos for applications, such as video editing, video tagging, video searching, and video surveillance. For example, in many cases, the untrimmed videos may be long videos (e.g., surveillance videos) including multiple actions of interest (which can be relatively short) and background scenes or activities. Manually localizing the actions of interest in the long videos can be time consuming and costly. Temporal action localization techniques have begun to be used to automatically determine whether a video stream includes specific actions (e.g., human activities) and identify the temporal boundaries (e.g., starting time and end time) of each action.
Due to the rapid development in artificial neural network and machine learning in recent years, many temporal action localization techniques use models (e.g., neural networks) generated using machine learning techniques to recognize actions in videos and localize the starting time and end time of each action. However, many machine learning techniques require large amounts of training data to train the models. For example, for some supervised learning systems to perform well, hundreds, thousands, or more of labeled training samples are needed. However, in many circumstances, labeled training samples, in particular, labeled video training samples, are very difficult, time-consuming, and/or expensive to obtain. Without sufficient training data, a model may not be as accurate or robust as desired. As such, models generated for temporal action localization may not perform as well as other models, such as models for object detection from still images (the training data for which is generally sufficient), due to the limited available training samples. It is a challenging task to train a temporal action localization model using limited labeled training video samples.