Visual activity recognition—the automatic process of recognizing semantic spatio-temporal target patterns such as “person carrying” and “vehicle u-turn” from video data—has been an active research area in the computer vision community for many years. Recently, the focus in the community has shifted toward recognizing activities/actions over large time-scales, wide-area spatial resolutions, and multi-source multi-modal frequencies in real-world operating conditions. It is assumed here that a pattern is bounded by event changes and target movement in-between events is an “activity”. In such conditions, a major challenge arises from the large intra-class variations in activities/events including variations in sensors (e.g., view-points, resolution, scale), target (e.g., visual appearance, speed of motion), and environment (e.g., lighting condition, occlusion, and clutter). The recognition of activities in overhead imagery poses many more challenges than from a fixed ground-level camera, largely because of the imagery's low resolution. Additionally, the need for video stabilization creates noise, tracking, and segmentation difficulties for activity recognition.
L. Xiey, L. Kennedy, S.-F. Changy, A. Divakaranx, H. Sunx, and C.-Y. Linz, Discovering Meaningful Multimedia Patterns With Audio-Visual Concepts and Associated Text, in Image Processing, ICIP, 2004, proposed a method for discovering meaningful structures in video through unsupervised learning of temporal clusters and associating the structures with meta data. The contents of which are incorporated herein by reference. For a news-domain model, they presented a co-occurrence analysis among structures and observed that temporal models are indeed better at capturing the semantics than non-temporal clusters. Using data from digital TV news, P. Doygulu and H. D. Wactlar, Associating Video Frames with Text, in the 26th ACM SIGIR Conference, 2003, proposed a framework to determine the correspondences between the video frames and associated text in order to annotate the video frames with more reliable labels and descriptions. The contents of which are incorporated herein by reference. The semantic labeling of videos enables a textual query to return more relevant corresponding images, and enables an image-based query response to provide more meaningful descriptors (i.e., content-based image retrieval).
Streaming airborne Wide Area Motion Imagery (WAMI) and Full-Motion Video (FMV) sensor collections afford on-line analysis for various surveillance applications such as crowded traffic scenes monitoring. In a Layered Sensing framework, a subset of these sensors may be used to simultaneously observe a region of interest to provide complimentary capabilities, including multi-band measurements, perspective diversity, and/or improved resolution for improved target discrimination, identification, and tracking. Typically, forensic analysis including pattern-of-life detection and activity/event recognition is conducted off-line due to huge volumes of imagery. This big data out-paces users' available time to watch all videos in searching for key activity patterns within the data.
A need, therefore, exists for a way to aid users in detecting patterns in aerial imagery, robust and efficient computer vision, pattern analysis and data mining tools.