Over the last decade eyetracking systems have become increasingly sophisticated. Several eyetracking systems are currently available as commercial products. These systems are typically based on three methods: video oculography (VOG), infrared oculography (IROG), and electro-oculography (EOG). The VOG method calculates eye position based on the relative position of the cornea and reflected pupil glint. This method uses a video camera to capture images of the eye, an infrared emitter to produce a point reflection on the eye images, and image processing methods to extract geometrical information from the images. The IROG method calculates eye position based on the amount of reflected light. This method uses mounted infrared emitters to produce reflections off of the eye, detectors that measure reflected light, and algorithms that determine eye position from the distribution of light across the geometrically located detectors. The EOG method calculates eye position based on electrical signals that are generated from electrodes placed around the eye. This method uses the signals from the geometrically positioned electrodes and algorithms that determine eye position from the distribution of signals across the electrodes. Each of these three methods typically locates eye position as an (x,y) point on a viewing plane or as a point (x,y,z) in 3-space at the rate of 1/30 second or faster.
Despite significant progress, widespread application of the technology has failed to become a reality. In part, this is due to the remaining high cost of eyetracker devices. An additional barrier, however, is the difficulty in mapping the sequence of fixation points produced by an eyetracker to objects that are present in the individual's field of view. Because of the point-to-object problem, successful use of eyetracker technology often requires manual frame-by-frame identification of fixated objects [Jacob & Karn, 2003]. Although this is adequate for some research applications, it is simply too costly and too cumbersome for widespread use.
Attempts at automating the mapping of points to objects have proven difficult. Generally, these attempts have fallen into two classes. One class involves using image-processing techniques to segment objects from a captured video sequence that corresponds to the field of view. Another class involves using features of the application to make the identification of objects easier. Although successes have been developed using approaches from both of these classes, a general solution to the point-to-object problem still remains unsolved.
The development of algorithms that automatically segment objects from a captured rasterized image stream has received much attention in the research literature. Despite extensive work, general-purpose, high-performance segmentation algorithms have proven elusive. Current segmentation algorithms often require custom development and remain error prone. Many of the current methods and techniques for automatic segmentation are described in [Fu & Mui, 1981], [Haralick & Shapiro, 1985], and [Pal & Pal, 1993]. An example of the segmentation approach is found in U.S. Pat. No. 6,803,887 that discloses a method for supplying a mobile user with service information relating to real world objects. This method relies on and assumes segmentation of real-world objects captured in raster format. It does not consider objects that are rendered on a computer display.
Applications that solve the point-to-object problem using problem features or problem simplifications have been reported in the research literature on gaze-responsive interfaces and gaze-responsive media. An approximate method that uses ray casting and a bounding sphere test was reported in an interactive fiction system in which the links that provide additional information are activated by eye gazes [Starker and Bolt, 1990]. A review of methods in gaze-based icon selection was discussed in [Jacob, 1993]. An eye control system for disabled users, EagleEyes, describes a control method based on clickable areas and a game, EyeVenture, in which looking at objects causes a video to be played or a voice to give instructions or a clue [Gips and Olivieri, 1996]. An application which identifies chessboard squares as viewed objects in a chess game was discussed in [Spakov, 2005].
Several methods for solving the point-to-object problem using problem features have been disclosed in the U.S patent literature. U.S. Pat. No. 4,789,235 discloses a method and system for monitoring people watching television commercials. The disclosed method relies on the use of visual areas of interest with fixed locations on different television scenes. U.S. Pat. No. 5,898,423 discloses a method that detects an intersection of the gaze position with the image displayed on the display device. The disclosed method relies on the fact that at any point in time an image location is fixed and known as found in the display of text images. U.S. Pat. No. 6,601,021 discloses a method for mapping fixation points to objects (elements-of-regard) in a display of dynamic hypermedia pages through a browser that is connected to the world-wide-web. The invention discloses a mapping tool that maps fixations onto restored web pages to identify fixated objects that are contained within the web pages. The mapping tool identifies elements of restored web pages by accessing the object model interface of the browser that provides the content information for each of the rendered objects (elements). The objects are generally HTML elements that the browser renders and include text, images, hyperlinks, buttons, input boxes, etc. This method relies on the fact that objects in restored web pages have known locations.
The more general problem of mapping fixation regions to objects has received limited attention in the literature. U.S. Pat. No. 6,803,887 describes a mobile device that locates objects in regions using predefined eye position values. U.S. Pat. No. 7,029,121 uses eyetracking data to determine whether an individual has looked at a particular region of a visual field, but does not identify objects within the region. A cognitive architecture, EPIC, describes a visual processor with three visual regions (fovea typical radius 1°, parafovea typical radius 10°, and periphery typical radius 60°) that models the recognition of objects within these regions [Kieras et al., 1997].
Many application areas that require an understanding of when objects are fixated to explore and exploit the subtle differences found in the problem-solving methods of individuals could greatly benefit from a robust and real-time solution to the region-to-object problem in dynamic graphical environments that simulate the real world. Such a solution would enable the construction of eye events that capture a user's visual interaction with a display screen in computer applications that use a graphical user interface to simulate the real world and to communicate with the user. In advanced training systems, eye events would enable the measurement of how an individual maintains situation awareness in complex problem-solving tasks. In the area of cognitive science, eye events would enable the exploration of how an individual allocates critical cognitive resources while operating in a multi-tasking problem solving environment. In vehicle systems, eye events would enable recording of operator visual interactions with the environment for later analysis directed at understanding the operator's cognitive activity during a critical operational time period. In interface evaluation, eye events would allow more detailed understanding of how computer users interact with a computer application under study in time critical or production settings. In computer games, eye events would allow the development of games with behaviors of the objects within the game that are derived from the movement of the game player's eyes. In cooperative interfaces, eye events would enable the development of interfaces that assist the human operator in solving complex problems by observing the operator's behavior and redirecting the operator to time critical needs and/or invoking intelligent agents (or processes) that would synergistically solve other parts of the problem.