1. Field of the Invention
Embodiments of the invention provide techniques for analyzing a sequence of video frames. More specifically, embodiments of the invention provide a video surveillance system configured to learn to recognize complex behaviors by analyzing pixel data using alternating layers of clustering and sequencing. Such a system may learn, over time, to identify anomalous behaviors at progressively more complex levels of abstraction.
2. Description of the Related Art
Some currently available video surveillance systems provide simple object recognition capabilities. For example, a video surveillance system may be configured to distinguish between scene foreground (active elements) and scene background (static elements) depicted in a video stream. A group of pixels (referred to as a “blob”) depicting scene foreground may be identified as an active agent in the scene. Once identified, a “blob” may be tracked from frame-to-frame, allowing the system to follow and observe the “blob” moving through the scene over time, e.g., a set of pixels depicting a person walking across the field of vision of a video surveillance camera may be identified and tracked from frame-to-frame.
Some such systems may also classify a blob as being a particular agent (e.g., a person or a vehicle) as well as determine when an object has engaged in certain predefined behaviors. For example, a system may be able to identify certain simple events: “vehicle stopped,” or “person enters vehicle,” etc. The analysis typically includes tracking an object, assigning an object type, and analyzing its position, direction, and velocity, etc., to recognize simple events such as stop, turn, start, etc. A limiting factor for these systems is that the objects and actions involved need to belong to a known, small set of types. The systems involved are usually trained on a set of examples and cannot recognize new behavior types when brought on-line. The actions or events are usually directly derived from the data of the tracked object. As a result, such systems have been unable to recognize higher-orders of behavior from the observations of basic or simple actions.
Some currently available systems employ statistical models such as Markov systems or Bayesian networks to analyze a scene depicted in a sequence of video frames. However, these systems have proven to be too slow for real-time use and/or require extensive hand design and parameter tuning. Thus, such systems must be carefully calibrated for a given scene, and as the scene changes, or as new or different behaviors evolve, the system needs be recalibrated. Further, given these limitations, current systems are unable to recognize unusual or unexpected behaviors; to work in a wide variety of real-life situations; or to adapt to a changing environment.
In other words, current video analysis systems rely on predefined objects and/or behaviors to evaluate a video sequence. And unless the underlying system includes a description for a particular object or behavior, the system is generally incapable of recognizing that behavior (via instances of the pattern describing the particular object or behavior). Thus, what is “normal” or “anomalous” is defined in advance and additional knowledge engineering or additional software products are required to recognize additional objects or behaviors. This results in video surveillance systems with recognition capabilities that are labor intensive and prohibitively costly to maintain or adapt for different specialized applications.