There is a need to analyze large amounts of video that is captured daily by surveillance systems. This analysis was carried out in the past manually. Since the amount of video data can be large, it would be desirable to automate and speed-up the analysis process. Most existing approaches for video summarization in the prior art are designed for entertainment content where the video is generally scripted and edited to capture an audience's attention. In such circumstances, appearance and appearance changes that are easy to observe can be captured using simple tools, such as converting video content to histograms.
Surveillance video, however, especially aerial surveillance video, has fewer dramatic changes of appearance than entertainment video. Furthermore, surveillance video lacks pre-defined entities, such as shots, scenes, and other structural elements, such as dialogues, anchors, etc. Automated systems in the prior art for image/video understanding associate key words, especially nouns, with a video image. Unfortunately, systems that use noun-based annotations as key words are inherently incapable of capturing spatial and temporal interactions among semantic objects in a video.
Accordingly, what would be desirable, but has not yet been provided, is a system and method for effectively and automatically converting video to text and/or speech to summarize the video.