Automatically monitoring and recognizing human activities is a long sought-goal in the computer vision community. Successful implementation of a vision system with the capabilities of automatically recognizing and describing human activities enables new applications such as automatic surveillance monitoring, intelligent transportation systems, manufacture automation, and robotics. Efficiently recognizing and representing complex human activities from a captured video scene are important for such recognition systems.
Human activities in a video sequence are often too complex to be accurately recognized and represented in the real world. For example, a human activity often consists of concurrent and/or partially ordered streams of actions over time. A typical complex activity may last tens of seconds to minutes, may include several sub-activities, and may involve interaction with several objects. Some approaches for detecting short-duration actions in video characterize specific actions using the statistical feature computed over the space-time domain defined by a video segment. A typical short-duration detection method uses a modest number of action classes and an action classifier learned by clustering statistical features computed from training video sequences. The challenges faced by these approaches include a lack of notion of semantic meanings which are used to develop an interpretive context for complex activities recognition.
Motivated by success in natural language processing, methods of using stochastic context free grammars for activity recognition include an interpretive context for more complex activities. A problem with this approach is a lack of a temporal model to efficiently describe a sequence of activities over time. Furthermore, this approach only addresses a single sequence of activities while the activities in the real world often happen in parallel over time.
Other activity recognition methods apply Hidden Markov Models (HMMs) to video streams. For example, multiple HMMs may be used for distinct actions in conjunction with an object detection system to exploit relationships between specific objects and actions. In general, HMM techniques suffer when an activity consists of concurrent or partially ordered streams of actions, which is often the case in real world activities. To handle concurrent and/or partially ordered streams of actions, an HMM must enumerate an exponentially large activity space.
Another class of approaches related to Hidden Markov Models use dynamic Bayesian networks (DBNs) to model activities. DBNs leverages rich event ordering constraints to deal with missing, spurious or fragmented tracks. Some problems with DBNs are the lack of efficient modeling of relationships between partially ordered actions, and lack of scalability to large numbers of activities and lack of appropriate models of action duration.
When describing activities that happen over time, temporal frequency and duration of each activity can be powerful contextual cues. Conventional activity recognition systems like DBNs either ignore temporal modeling or use very simple model such as Gaussians. However, when Gaussian models are used to model temporal constraints for activity recognition, first, Gaussian distributions must either be learned for each action in the model, or generalized for a set of disparate actions. Learning such distributions requires time consuming labeling. Second, Gaussian models still may not provide a meaningful temporal model for actions since the semantic description of an action is often independent of whether the action is performed for a long or short time or interrupted for some indefinite period. Finally, Gaussian models do not incorporate any information about temporal relationships between actions such as occurrence rate and idle time. In general, for many cases a single distribution cannot meaningfully capture the variation in how an action is performed. The action duration can vary greatly depending on the situation and it is unrealistic to expect to have a general duration model for many actions.