1. Field of the Invention
The present invention generally relates to the field of human-computer interaction and user interface technology. More particularly, the present invention relates to a system and method that determines a user's intent or choice by comparing, for example, the user's eye motion response resulting from a computer or software generated and presented animation sequence stimulus.
2. Background Information
Human-computer interaction (HCI) is concerned with ways that humans and computers can communicate to cooperatively solve problems. HCI technology has improved dramatically since the first digital computers equipped with teletype terminals. Users first communicated with early computers through command-based text entry, and primarily through a keyboard. The introduction of the graphical user interface, the so-called Windows-Icons-Menus-Pointing (WIMP) model, introduced a spatial dimension to user interfaces and a new style of interaction, which allowed use of a "mouse" pointing device to identify objects and windows that are intended as the target of the communication, and a small set of commands, e.g., mouse clicks, that could convey a particular meaning (e.g., open this window, start this program, or get the help information on this item). In each case, the meaning of the mouse click is interpreted in the context of what user interface object or objects the mouse is currently positioned over. Today's interfaces are called "graphical user interfaces" because they replace much of the prior text-based communication between the human and the computer, with graphical objects that are capable of conveying information and accepting input to accomplish similar results that were previously accomplished by text-based methods. Nevertheless, all these approaches are "command-based," with an explicit dialog occurring between the user and the computer wherein the user issues commands to the computer to do something. In contrast, non-command-based interfaces (see, for example, NIELSEN, Jakob, "Noncommand User Interfaces," Communications of the ACM, Vol. 36, No. 1, January, 1993, pp. 83-99) passively monitor the user, collecting data, such as the recent history of interactions, head position, eye gaze position, and others so that they may better characterize the user's problem solving context, and be in a position to offer automated support that is finely focused on the current problem context. For example, if a computer program has information on what user interface object the user is currently looking at, the program can offer more detailed information about the subject represented by that object (see, for example, TOGNAZZINI, U.S. Pat. No. 5,731,805).
Current computer systems provide users a limited range of input device choices. In addition to the keyboard, personal computers and workstations invariably come equipped from the manufacturer with some kind of pointing device such as a mouse device. Users may purchase other pointing devices from a variety of third party vendors. These other pointing devices include remote mice, trackballs, forceballs, light pens, touch screen panels, and more exotic hardware like head pointing and gesture devices (e.g., data gloves). All these devices can produce output that is suitable for controlling a cursor on the screen. Alternatively, users may obtain speech recognition software that reduces or virtually eliminates the need for any kind of pointing device or keyboard.
Interacting with a command-based graphical user interface usually involves a two step process. First, a user interface object must obtain the "focus" and second, the user must provide "consent" so that the computer can perform the action associated with the user interface object. This process may be referred to as the "focus-consent" interaction style. For example, pushing a button user interface control (object) involves first positioning the mouse cursor over the button, and then pressing the left mouse button down while maintaining the mouse cursor over the button's spatial extent. In command line (text-based) interfaces, the focus and consent are provided by typing in the desired application program or operating system commands and pressing the enter key. Effectively, the user is specifying what he/she wants the computer to do, and reviewing it for correctness before pressing the enter key.
Eye point-of-gaze (POG) was proposed for use in controlling an interface as early as the 1970's (see, for example, BOLT, Richard A., The Human Interface: Where People and Computers Meet, Lifetime Learning, London, 1984). In its early conceptualization, a computer user would fixate a particular area on a computer display, and after holding his gaze there for some set period of time (i.e., a dwell period), the computer would respond appropriately. For example, if the user stares at the letter "A" for more than say 0.5 seconds, it is considered to be "selected" and the computer might respond by typing an "A" into a text entry box. An application of this method would be an eye-gaze activated keyboard, which would be particularly useful for people who cannot or do not wish to use a conventional keyboard. Eye point-of-gaze has also been proposed for use in aviation displays to reduce manual workload (see, for example, CALHOUN et al., Gloria L., "Eye-Controlled Switching for Crew Station Design," Proceedings of the Human Factors Society, 28th Annual Meeting, 1984; BORAH, Joshua, "Helmet Mounted Eye Tracking for Virtual Panoramic Display Systems, Volumes I & II: Eye Tracker Specification and Design Approach," Final Report., AAMRL-TR-89-019, AAMRL, WPAFB, OH, 1989), for control of head-mounted displays (see, for example, SMYTH, U.S. Pat. No. 5,689,619), and for general purpose computer use (see, for example, JACOB, Robert J. K., "The Use of Eye Movements in Human-Computer Interaction Techniques: What You Look At is What You Get," ACM Transactions on Information Systems, Vol. 9, No. 3, April, 1991, pp. 152-169; HATFIELD et al., Franz, "Eye/Voice Mission Planning Interface," published as Armstrong Laboratory Tech. Report AL/CF-TR-1995-0204, Synthetic Environments, Inc., McLean, Va., 1995).
Performance Constraints Imposed by the Human Visual System
The human visual system produces several different types of eye movement which are summarized in BORAH. In scanning a visual scene, a subject's eye movement typically consists of a series of stops at visual targets, called fixations, and rapid jumps between targets, called saccades. Visual information is acquired primarily during fixations, which typically last at least 200 milliseconds. Saccades last from 30 to 120 milliseconds and reach velocities of 400 to 600 degrees per second. During fixations, the eyes exhibit several types of involuntary motion, including microsaccades (also called flicks), drifts and tremor. Microsaccades, the source of the greatest movement, serve the purpose of re-centering an image on the fovea. This involuntary motion is usually less than one degree of visual angle, but is important for the design of eye-tracking systems and user interfaces that make use of eye-tracking input. Because of this involuntary motion, if a user is attempting to fixate a visual target that is located close to another target, it may be difficult for the eye-tracking system to adequately discriminate which target is intended, simply because the observed phenomenon--involuntary eye motion--cannot be controlled. Given both the inherent behavior of the human visual system, and the accuracy of currently engineered eye-tracking systems, users may be required to stare at the intended target sufficiently long so that the eye-tracking system can disambiguate the intended target. This is usually accomplished by some kind of averaging process, so that the centroid of observations over some period of time gives a reasonable estimate of what the user is fixating. One way to alleviate the staring burden on the user is to make visual targets larger, but this has the negative effect of requiring additional display space. In aviation applications, as an example, display space is scarce and comes at a premium. The prospects for achieving better discrimination of stationary targets with better eye-tracking accuracy are not promising, since the accuracy levels of current art systems are already at about 0.5 degrees of visual angle. Since the involuntary motion of the eye may approach 1.0 degree as discussed above, current system accuracy levels are already working within the performance envelope of the human visual system.
As further disclosed herein, the present invention entails setting a computer-generated graphical object into motion, a form of computer animation. Animation has been used in computer interfaces primarily to convey information to the user, rather than as a device for eliciting information from the user. Animation has been used to impart understanding and maintain user interest, as well as providing entertainment value. Animation has been used to draw interest to a particular item on a computer display, after which the user is expected to undertake some overt action with an input device. For example, video games use animation to stimulate a response from a player, e.g., shooting at a target. But the user input stimulated is not to express a choice or necessarily control what happens next in the game, but to give the user the opportunity to show and/or improve his/her skill and enable the user to derive enjoyment from the contest. Animation has not been directly used as a means for eliciting choice information from the user.
Problems in Using Eye Gaze in An Interface
There are several problems in conventional systems with using eye-gaze to control a computer interface. First, the current art relies on extended, unnatural dwell times in order to cause a user interface object to gain focus. Second, in the current art, if a user succeeds in causing a user interface object to gain the focus, some other form of input is required to provide consent. While the consent may be provided by eye blink, or by requiring an even greater dwell period over the focal user interface object, this is difficult and unnatural.
When a computer user moves the cursor over a graphical user interface object, the user has nominally four (4) different actions that could be accomplished with a two button mouse, and six (6) different actions that could be accomplished with a three button mouse, allowing both single and double clicks. The number goes up higher if chorded keys are allowed, e.g., clicking the left mouse button while holding the control key down. However, when eye point-of-gaze is used alone, the range of expressions that are possible, compared to a mouse, is relatively limited. Simply by detecting that a user is fixating an object, it may not be clear what the user's intent or selection is. Presumably, one could map the length of dwell time into a small set of actions to obtain some discrimination. For example, if the user maintains point-of-gaze for 0.5 seconds, action 1 would be inferred; for 0.5 to 1.0 seconds, action 2 would be inferred; and for dwell times greater than 1.0 seconds, action 3 would be inferred. The main problem with this method is that it is difficult for humans to control their point-of-gaze with any accuracy for even these relatively short periods of time. In addition, in tasks requiring high performance and possibly parallel operations (e.g., speech and visual tasks), the cognitive effort required to maintain point-of-gaze reduces the time available to devote to other tasks, e.g., scanning other areas of the display, performing manual actions or formulating speech responses. As a result of these limitations, when eye gaze is used in an interface, it is usually desirable, if not absolutely necessary, to include some other input modality, e.g., eye blink, voice, switch, or key press input to complete the dialog with the computer (see, for example, HATFIELD et al.; CALHOUN et al.; JACOB; SMYTH).