Eye-tracking has a well-established history of revealing valuable information about visual perception and more broadly about cognitive processes. Within this field of research, the objective is often to examine how an observer visually engages with the content or layout of an environment. When the observer's head is stationary (or accurately tracked) and the stimuli are static (or their motion over time is recorded), commercial systems exist that are capable of automatically extracting gaze behavior in scene coordinates. Outside the laboratory, where observers are free to move through dynamic environments, the lack of constraints precludes the use of most existing automatic methods.
A variety of solutions have been proposed and implemented in order to overcome this issue. One approach (“FixTag”) utilizes ray tracing to estimate fixation on three-dimensional (3D) volumes of interest. In this scheme, a calibrated scene camera is used to track features across frames, allowing for the extraction of 3D camera movement. With this, points in a two-dimensional (2D) image plane can be mapped onto the scene camera's intrinsic 3D coordinate system. This allows for accurate ray tracing from a known origin relative to the scene camera. While this method has been shown to be accurate, it has limitations. Critically, it requires an accurate and complete a priori map of the environment to relate object identities with fixated volumes of interest. In addition, all data collection must be completed with a carefully calibrated scene camera, and the algorithm is computationally intensive.
Another proposed method is based on Simultaneous Localization and Mapping (SLAM) algorithms originally developed for mobile robotics applications. Like FixTag, current implementations of SLAM-based analyses require that the environment be mapped before analysis begins, and are brittle to scene layout changes, precluding their use in novel and/or dynamic environments.