Science and technology have made great strides in providing society with tools that increase productivity and decrease human workloads. In this regard, computing technologies have played a huge part in allowing control of complex procedures and machinery which facilitate work. As is typical, once a new technology becomes a “norm” or standard, society demands something better. Quite often, despite the best implementations, a user of technology can become complacent and inattentive at crucial moments of activity. This can lead to personal injuries to the operator and/or serious damage to equipment. Because current technology is largely unaware of a user's “state” or activity, it cannot foresee even what a bystander might deem as an “inevitable” outcome. For example, a bystander watching a motorist who has fallen asleep at the wheel and is approaching a busy intersection would probably predict that an accident is most likely to occur. However, if the vehicle that the sleeping motorist is driving were “aware” that it was being operated by a sleeping driver headed for a busy intersection, the vehicle could implement steps to avert an accident by shutting off the engine, applying brakes, and/or waking the driver in time to avoid an accident.
In a similar fashion, if a system could anticipate a user's needs and/or desires, processes can be deployed to increase that user's productivity. Location and identity have been the most common properties considered as comprising a user's situation in “context-aware” systems. Context can include other aspects of a user's situation, such as the user's current and past activities and intentions.
Most of the prior work on leveraging perceptual information to recognize human activities has centered on the identification of a specific type of activity in a particular scenario. Many of these techniques are targeted at recognizing single, simple events, e.g., “waving the hand” or “sitting on a chair.” Less effort has been applied to research on methods for identifying more complex patterns of human behavior, extending over longer periods of time.
Another tool utilized in determining action based on awareness is decision theory. Decision theory studies mathematical techniques for deciding between alternative courses of action. The connection between decision theory and perceptual systems (e.g., computer vision applications) received some attention by researchers in the mid-70's, but then interest faded for nearly a decade. Decision theory was utilized to characterize the behavior of vision modules (see, R. C. Bolles; Verification Vision For Programmable Assembly; In Proc. IJCAI'77, pages 569-575; 1977), to score plans of perceptual actions (see, J. D. Garvey; Perceptual Strategies For Purposive Vision; Technical Report 117; SRI International; 1976), and plans involving physical manipulation with the option of performing simple visual tests (see, J. A. Feldman and R. F. Sproull; Decision Theory And Artificial Intelligence II: The Hungry Monkey; Cognitive Science, 1:158-192; 1977). This early work introduced decision-theoretic techniques to the perceptual computing community.
Following this early research, there was a second wave of interest in applying decision theory in perception applications in the early 90's, largely for computer vision systems (see, H. L. Wu and A. Cameron; A Bayesian Decision Theoretic Approach For Adaptive Goal-Directed Sensing; ICCV, 90:563-567; 1990) and in particular in the area of active vision search tasks (see, R. D. Rimey; Control Of Selective Perception Using Bayes Nets And Decision Theory; Technical Report TR468; 1993).
A significant portion of work in the arena of human activity recognition from sensory information has harnessed Hidden Markov Models (HMMs) (see, L. Rabiner and B. H. Huang; Fundamentals of Speech Recognition; 1993) and extensions. Starner and Pentland (see, T. Starner and A. Pentland; Real-Time American Sign Language Recognition From Video Using Hidden Markov Models; In Proceed. of SCV'95, pages 265-270; 1995) utilize HMMs for recognizing hand movements used to relay symbols in American Sign Language. More complex models, such as Parameterized-HMMs (see, A. Wilson and A. Bobick; Recognition And Interpretation Of Parametric Gesture; In Proc. of International Conference on Computer Vision, ICCV'98, pages 329-336; 1998), Entropic-HMMs (see, M. Brand and V. Kettnaker; Discovery And Segmentation Of Activities In Video; IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8); 2000), Variable-length HMMs (see, A. Galata, N. Johnson, and D. Hogg; Learning Variable Length Markov Models Of Behaviour; International Journal on Computer Vision, IJCV, pages 398-413; 2001), Coupled-HMMs (see, M. Brand, N. Oliver, and A. Pentland; Coupled Hidden Markov Models For Complex Action Recognition; In Proc. of CVPR97, pages 994-999; 1996), structured HMMs (see, F. Bremond S. Hongeng and R. Nevatia; Representation And Optimal Recognition Of Human Activities; In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR'00; 2000) and context-free grammars (see, Y. Ivanov and A. Bobick; Recognition Of Visual Activities And Interactions By Stochastic Parsing; IEEE Trans. on Pattern Analysis and Machine Intelligence, TPAMI, 22(8):852-872; 2000) have been utilized to recognize more complex activities such as the interaction between two people or cars on a freeway.
In recent years, more general dependency models represented as dynamic Bayesian networks have been adopted for the modeling and recognition of human activities [see, (E. Horvitz, J. Breese, D. Heckerman, D. Hovel, and K. Rommelse; The Lumière Project: Bayesian User Modeling For Inferring The Goals And Needs Of Software Users; In Proc. of Fourteenth Conf. in Artificial Intelligence, pages 256-265; 1998), (A. Madabhushi and J. Aggarwal; A Bayesian Approach To Human Activity Recognition; In Proc. of the 2nd International Workshop on Visual Surveillance, pages 25-30; 1999), (Jesse Hoey; Hierarchical Unsupervised Learning Of Event Categories; Unpublished Manuscript; 2001), (J. H. Fernyhough, A. G. Cohn, and D. C. Hogg; Building Qualitative Event Models Automatically From Visual Input; In ICCV'98, pages 350-355; 1998), (Hilary Buxton and Shaogang Gong; Advanced Visual Surveillance Using Bayesian Networks; In International Conference on Computer Vision, pages 111-123; Cambridge, Mass.; June 1995), (Stephen S. Intille and Aaron F. Bobick; A Framework For Recognizing Multi-Agent Action From Visual Evidence; In AAAI/IAAI'99, pages 518-525; 1999), and (J. Forbes, T. Huang, K. Kanazawa, and S. Russell; The Batmobile: Towards A Bayesian Automated Taxi; In Proc. Fourteenth International Joint Conference on Artificial Intelligence, IJCAI'95; 1995)].
Finally, beyond recognizing specific gestures or patterns, the dynamic Bayesian network models have been used to make inferences about the overall context of the situation of people. Recent work on probabilistic models for reasoning about a user's location, intentions, and focus of attention have highlighted opportunities for building new kinds of applications and services (see e.g., E. Horvitz, C. Kadie, T. Paek, D. Hovel, Models of Attention in Computing and Communications: From Principles to Applications, Communications of the ACM 46(3):52-59, March 2003 and E. Horvitz, A. Jacobs, and D. Hovel. Attention-Sensitive Alerting; In Proc. of Conf on Uncertainty in Artificial Intelligence, UAI'99, pages 305-313; 1999).
Thus, technology researchers have long been interested in the promise of performing automatic recognition of human behavior from observations. Successful recognition of human behavior is critical in a number of compelling applications, including automated visual surveillance and multimodal human-computer interaction (HCI)—considering multiple streams of information about a user's behavior and the overall context of a situation to provide appropriate control and services. There has been progress on multiple fronts. However, a number of challenges remain for developing machinery that can provide rich, human-centric notions of context in a tractable manner without the computational burden generally imposed by these systems.
Computation for visual and acoustical analyses has typically required a large portion—if not nearly all—of the total computational resources of personal computers that make use of such perceptual inferences. It is not surprising to find that there is little interest in invoking such perceptual services when they require a substantial portion of the available CPU time, significantly slowing down more primary applications that are supported and/or extended by the perceptual apparatus. Thus, the pursuit of coherent strategies for automatically limiting the analytic load of perceptual systems has steadily moved to the forefront of the technological challenges facing this field.