Computer vision-based sensing of users enables a new class of public multi-user computer interfaces. An interface such as an automated information dispensing kiosk represents a computing paradigm that differs from the conventional desktop environment and correspondingly requires a user interface that is unlike the traditional Window, Icon, Mouse and Pointer (WIMP) interface. Consequently, as user interfaces evolve and migrate off the desktop, vision-based human sensing will play an increasingly important role in human-computer interaction.
Human sensing techniques that use computer vision can play a significant role in public user interfaces for kiosk-like computerized appliances. Computer vision using unobtrusive video cameras can provide a wealth of information about users, ranging from their three dimensional location to their facial expressions, and body posture and movements. Although vision-based human sensing has received increasing attention, relatively little work has been done on integrating this technology into functioning user interfaces.
The dynamic, unconstrained nature of a public space, such as a shopping mall, poses a challenging user interface problem for a computerized kiosk. This user interface problem can be referred to as the public user interface problem, to differentiate it from interactions that take place in a structured, single-user desktop environments. A fully automated public kiosk interface must be capable of actively initiating and terminating interactions with users. The kiosk must also be capable of dividing its resources among multiple users in an equitable manner.
The prior art technique for sensing users as applied in the Alive system is described in "Pfinder: Real-time Tracking of the Human Body," Christopher Wren, Ali Azarbayejani, Trevor Darrell, and Alex Pentland, IEEE 1996. Another prior art system is described in "Real-time Self-calibrating Stereo Person Tracking Using 3-D Shape Estimation from Blob Features," Ali Azarbayejani and Alex Pentland, ICPR January 1996.
The Alive system senses only a single user, and addresses only a constrained virtual world environment. Because the user is immersed in a virtual world, the context for the interaction is straight-forward and simple, and vision and graphics techniques can be employed. Sensing multiple users in an unconstrained real-world environment, and providing behavior-driven output in the context of that environment presents more complex vision and graphics problems stemming from the requirement of real world interaction that are not addressed in prior art systems.
The Alive system fits a specific geometric shape model, such as a Gaussian ellipse, to a description representing the human user. The human shape model is referred to as a "blob." This method of describing shapes is generally inflexible. The Alive system uses a Gaussian color model which limits the description of the users to one dominant color. Such a limited color model limits the ability of the system to distinguish among multiple users.
The system by Azarbayejani uses a self-calibrating blob stereo approach based on a Gaussian color blob model. This system has all of the disadvantages of inflexibility of the Gaussian model. The self-calibrating aspect of this system may be applicable to a desktop setting, where a single user can tolerate the delay associated with self-calibration. In a kiosk setting, it would be preferable to calibrate the system in advance so it will function immediately for each new user.
The prior art systems use the placement of the user's feet on the ground plane to determine the position of the user within the interaction space. This is a reasonable approach in a constrained virtual-reality environment, however this simplistic method is not acceptable in a real-world kiosk setting where the user's feet may not be visible due to occlusion by nearer objects in the environment. Furthermore, the requirement to detect the ground plane may not be convenient in practice because it tends to put strong constraints on the environment.
It remains desirable to have an interface paradigm for a computerized kiosk in which computer vision techniques are used not only to sense users but also to interact with them.