Advances in video capture technologies has allowed an average consumer to become a producer of media content using handheld devices, such as mobile phones and camcorders. This media content can be shared through a media sharing technology, for example, a social network, a media content sharing site, email, messaging, or other suitable technologies. Oftentimes, it is advantageous to classify the video or portions of the video according to action(s) contained in the video, such as running, diving, throwing, kicking, falling, swinging, or any other suitable action. For example, a video captured of a child throwing a football at a little league football game can be classified as football or throwing. Conventionally, this classification is performed using manual classification or automatically using global metadata associated with the entire video, such as title, location, interest points, or any other suitable metadata associated with the video. However, these approaches ignore localized spatial and temporal information associated with an action in frames of a video.