The present disclosure relates to object detection.
Human activity understanding by machines is important for reliable human-machine interaction and human behavior prediction. In vehicles, for example, accurately interpreting driver activity, such as interactions between the user's hands and objects in the vehicle (e.g., cellphones, coffee cups, cigarettes, controls, compartments, etc.) would allow for many improvements in vehicle safety and performance. However, configuring a machine to understand human activity is not a trivial endeavor and various existing solutions, such as existing touch-based and vision-based systems, are unsuitable for dynamic environments where real-time detection may be needed because they are too slow or computationally expensive (requiring expensive hardware upgrades to function properly).
Some approaches detect a driver's hands when placed on a steering wheel. These systems may use touch-based hand detection, which can identify the presence and location of driver's hands on the steering wheel. In some cases, a sensor mat may be used that includes multiple sensor loops arranged on the steering wheel to achieve multiple sensing zones. However, these approaches are limited since they generally can only detect the driver's hands when they are placed on the steering wheel, and fail to detect the driver's hands when in other locations during operation of the vehicle, such as moving in the air, placed on lap, etc. Further, in general these approaches are unable to differentiate hands from other moveable body parts.
Some approaches use vision-based hand detection algorithms. For instance, Mittal, Arpit, Andrew Zisserman, and Philip Torr, “Hand Detection Using Multiple Proposals,” Proceedings of the British Machine Vision Conference 2011 (2011), describe a vision-based hand detection algorithm that uses a camera to capture an image of the scene. The algorithm is capable of identifying the hands present in the image. In particular, three hand detectors are used to respectively determine 1) texture information to capture the shape of the hand; 2) context information to capture nearby body parts like a forearm; and 3) color information to capture the distinctive skin color. The output of these detectors is combined to make a final determination. However, the algorithm is limited because it is unable to provide any information about the interaction between the driver's hands and car components. Additionally, the algorithm is heavily color and texture based, and as a result, is unable to suitably function in more difficult or changing lighting conditions, such as when the hands are occluded or significantly deformed. The algorithm is also comparatively slow, as each image requires about 2 minutes of processing time, which is unsuitable for dynamic, real-time systems, such as those used in cars and other vehicles.
Another approach, such as that described by Ohn-Bar, Eshed, and Mohan Trivedi, “In-vehicle Hand Activity Recognition Using Integration of Regions,” 2013 IEEE Intelligent Vehicles Symposium (IV) (2013), proposes a vision-based Region-Of-Interest (ROI) hand detection algorithm for in-vehicle hand activity recognition. By using color, texture, and global statistic features on the color and depth image, the algorithm can determine the number of hands present in each fixed ROI. However, the algorithm is unable to provide the refined location of the hand within each ROI, due to its ROI-based nature. Also, as with other solutions, it is heavily color and texture based and unable to suitably function in more difficult or changing lighting conditions. Further, the algorithm utilizes a fixed ROI-configuration, from a fixed viewpoint (behind the driver's head), and has a large footprint from the Kinect™ sensor being used, which makes it difficult to generalize to other sensor configurations and vehicle types (e.g., a different steering wheel shape). The algorithm also runs at a relatively slow frame rate of about 2 FPS, which is unsuitable for dynamic, real-time systems, such as those used in cars and other vehicles.
Some further approaches use a sliding window technique to perform hand detection. For example, Das, Nikhil, Eshed Ohn-Bar, and Mohan M. Trivedi, “On Performance Evaluation of Driver Hand Detection Algorithms: Challenges, Dataset, and Metrics,” 2015 IEEE 18th International Conference on Intelligent Transportation Systems (2015), describe a baseline algorithm that leverages color and texture features and a sliding window detection framework, which can give locations of hands in forms of a fitted bounding box. While the dataset used by the algorithm attempts to address challenges that are observed in naturalistic driving settings, due to its heavy color based nature, the algorithm is unable to effectively differentiate hands from other body parts (e.g., faces, forearms, etc.) that generally have the same color. Further, as with other solutions, the algorithm fails to accommodate difficult lighting conditions or significant lighting changes that may deform or occlude hands. Further, the algorithm, and the other above-discussed solutions, are configured to utilize high-level, generalized information about the road which lacks granularity. As a result, these solutions are unable to account for finer roadway detail, such as fine-grained (e.g., street level) detail, when a driver is presented with a particular driving situation.