Various technologies for sensing human-object interactions, such as three-dimensional (3D) sensing technologies, have been investigated to improve performance of tasks such as object detection, object recognition, and image segmentation. For example, interactions of a healthcare professional with medicines and medical instruments during a medical procedure needs to be accurately determined to track adequate healthcare being provided to a patient. Similarly, activities of a passenger in a surveillance video may be recognized through his interactions with various objects in a transportation environment.
3D images typically include a combination of depth and color information to represent features such as edges, lines, corners, and shapes of various image objects. Color information refers to RGB (Red, Green, Blue) data for object features defined using a variety of techniques such as scale-invariant feature transform (SIFT), histograms of oriented gradients (HoG), and speeded-up robust features (SURF) interest points. Depth information provides geometrical cues or estimates about the object features relative to a viewpoint, such as from a camera. Such geometrical cues are invariant to lighting or color variations, and therefore allow better separation of object features from the background. Such 3D images having both depth and color information are also referred to as RGB-D images being an aggregation of RGB images and depth images or depth map.
Various machine learning techniques such as convolutional neural networks (CNNs) are used to recognize image objects as such techniques can automatically learn 3D image features without the features being manually designed to capture depth invariances or deformations (e.g., translation, rotation, skew, etc.). Conventionally, CNNs are employed to extract image features separately from depth and color modalities, and then combine these features later using a late fusion technique. However, as color images and depth scans are correlated to often manifest depth discontinuities as strong edges in color images, the late fusion technique causes inefficiencies while learning these correlations. Additionally, such technique lacks benefit from the other modalities present in the training data when one of the modalities is absent at test time. Further, various tasks are performed poorly in the absence of depth images during testing.
It may therefore be beneficial to provide robust systems and methods for object recognition that are independent of the object features being available for learning.