There is a need for eye gaze tracking applications and gaze-based human computer interactions for dynamic platforms such as driver monitoring systems and handheld devices. For an automobile driver, eye based cues such as levels of gaze variation, speed of eyelid movements and eye closure can be indicative of a driver's cognitive state. These can be useful inputs for intelligent vehicles to understand driver attentiveness levels, lane change intent, and vehicle control in the presence of obstacles to avoid accidents. Handheld devices like smartphones and tablets may also employ gaze tracking applications wherein gaze may be used as an input modality for device control, activating safety features and controlling user interfaces.
The most challenging aspect of such gaze applications includes operation under dynamic user conditions and unconstrained environments. Further requirements for implementing a consumer-grade gaze tracking system include real-time high-accuracy operation, minimal or no calibration, and robustness to user head movements and varied lighting conditions.
Traditionally, gaze estimation has been done using architectures based on screen light reflection on the eye where corneal reflections from light can be used to estimate point-of-gaze.
Neural networks have also been applied to the problem and S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” Pittsburgh, Pa., USA, Tech. Rep., 1994 discloses using a neural network to map gaze coordinates to low quality cropped eye images.
Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, Antonio Torralba, “Eye Tracking for Everyone” discloses an appearance based convolutional neural network (CNN) based model that uses face landmarks to crop an image into left and right regions. The eye regions and face are then passed to distinct neural networks which output into shared fully connected layers to provide a gaze prediction.
Similarly, M. Kim, O. Wang and N. Ng “Convolutional Neural Network Architectures for Gaze Estimation on Mobile Devices”, Stanford Reports, 2017, referring to Krafka also uses separate eye regions extracted from a face region as well as a histogram of gradients map to provide a gaze prediction.
Rizwan Ali Naqvi, Muhammad Arsalan, Ganbayar Batchuluun, Hyo Sik Yoon and Kang Ryoung Park, “Deep Learning-Based Gaze Detection System for Automobile Drivers Using a NIR Camera Sensor”, Sensors 2018, 18, 456 discloses capturing a driver's frontal image, detecting face landmarks using a facial feature tracker, obtaining face, left and right eye images, calculating three distances based on three sets of feature vectors and classifying a gaze zone based on the three distances.
X. Zhang, Y. Sugano, M. Fritz, and A. Bulling in both “Appearance-based gaze estimation in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 4511-4520 and “MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, disclose using face detection and facial landmark detection methods to locate landmarks in an input image obtained from a calibrated monocular RGB camera. A generic 3D facial shape model is fitted to estimate a 3D pose of a detected face and to crop and warp the head pose and eye images to a normalised training space. A CNN is used to learn the mapping from the head poses and eye images to gaze directions in the camera coordinate system.