Enormous amount of video and image data is generated and shared every day, thanks to the ubiquity of digital cameras and Internet applications such as social networks. Video and image data constitutes a large portion of Internet content. Video is one of the most complex media formats available in the digital age. A video generally includes at least one audio track, some metadata, and thousands of image frames. With the overwhelmingly large amount of video data available, a need to automatically understand such massive amounts of data has risen. For example, some videos may include content that may not be appropriate for certain groups of people, such as content including nudity, violence, extremist, firearms, alcohol, or tobacco, and thus may not be suitable for associating with certain commercial content. By understanding the content of the video, one may determine whether the video is “brand safe” and therefore is suitable for monetizing, such as incorporating commercial content associated with a brand.