(1) Field of Invention
The present invention relates to a behavior recognition system and, more particularly, to a behavior recognition system that utilizes cognitive swarms and fuzzy graphs to identify spatial and temporal relationships between objects detected in video image sequences that are signatures of specific events.
(2) Related Art
Most existing event detection algorithms are very simplistic and domain-specific. Such algorithms were described by N. Oliver and A. Pentland in a publication entitled, “Graphical Models for Driver Behavior Recognition in a Smart Car,” Proc. of IV2000 (hereinafter referred to as “Oliver et al.”), and by S. Hongeng, R. Nevatia, and F. Bremond in a publication entitled, “Video-based event recognition: activity representation and probabilistic recognition methods,” CVIU 96(2004), 129-162 (hereinafter referred to as “Hongeng et al.”).
Current systems first detect moving objects using background subtraction methods that typically suffer from shadows, occlusions, poor video quality and the need to specify view-dependent foreground object rules. Scenarios that are often detected are people walking, running, etc. and usually involve only a single object. Also, past work on event detection has mostly consisted of extraction of object trajectories followed by a supervised learning using parameterized models for actions. For example, Hongeng et al. describes a generic scheme for event modeling that also simplifies the parameter learning task. Actors are detected using probabilistic analysis of the shape, motion, and trajectory features of moving objects. Single agent-events are then modeled using Bayesian networks and probabilistic finite-state machines. Multi-agent events, corresponding to coordinated activities, are modeled by propagating constraints and likelihoods of single-agent events in a temporal logic network.
In Oliver et al., the authors presented a layered probabilistic representation for modeling human activity. The representation is then used to learn and infer user actions at multiple levels of temporal granularity.
Another publication, by A. Amir, S. Basu, G. Iyengar, C. Lin, M. Naphade, J R Smith, S. Srinivasan, and B. Tseng, entitled, “A multi-modal system for retrieval of semantic video events,” CVIU 96(2004), 216-236, describes a system for automatic and interactive content-based and model-based detection and retrieval of events and other concepts. Models of semantic concepts are built by training classifiers using training video sequences. These models are then used to classify video segments into concepts such as “water skiing,” “person speaking,” etc.
The work by K. Sato and J. K. Aggarwal, in “Temporal Spatio-velocity transform and its application to tracking and interaction,” CVIU 96(2004), 100-128, describes a novel transformation that elicits pixel velocities from binary image sequences. Basic object interactions, such as “MEET,” “FOLLOW,” “LEAVE,” etc., are then detected using motion-state transitions and shapes of object trajectories.
An approach to detecting when the interactions between people occur as well as classifying the type of interactions, such as “following another person,” etc., is presented by N. Oliver, A. Garg, and E. Horvitz, in “Layered representations for learning and inferring office activity from multiple sensory channels,” CVIU 96(2004), 163-180.
In a publication by G. Medioni, I. Cohen, F. Bremond, S. Hongeng, R. Nevatia, entitled, “Event detection and analysis from video streams,” IEEE PAMI 23(8), 2001, 873-889, the authors introduce an approach that takes video input from an airborne platform and produces an analysis of the behavior of moving objects in the scene. In their approach, graphs are used to model scene objects across the frames (i.e., nodes in the graph correspond to objects in contiguous frames).
While the prior art describes event detection, it does not separate the object detection task from the structural and temporal constraint detection tasks. Nor does the prior art employ swarm-based object recognition. Thus, a need exists for a behavior recognition system that uses swarm-based optimization methods to locate objects in a scene and then uses graph-matching methods to enforce structural and temporal constraints. Such a system has advantages in terms of generality, accuracy, and speed.