Hand gestures and gesticulations are a common form of human communication. It is therefore natural for humans to use this form of communication to interact with machines as well. For instance, touch-less human computer interfaces in vehicles can improve comfort and safety. Recently, deep convolutional neural networks (CNN) receiving video sequences of gestures as inputs have proven to be an effective algorithm for gesture recognition, and have significantly advanced the accuracy of dynamic hand gesture and action recognition tasks. CNNs are also useful for optimally combining the input data from multiple sensors (multi-modal data) for gesture recognition in challenging lighting conditions. However, real world systems for dynamic hand gesture recognition present numerous open challenges that are yet to be addressed.
First, the systems receive continuous streams of unprocessed visual data, where gestures known to the system must be simultaneously detected and classified. Conventional systems, typically regard gesture segmentation and classification separately. Two classifiers, a detection classifier to distinguish between “gesture and “no gesture”, and a recognition classifier to identify the specific gesture type, are often trained separately and applied in sequence to the input data streams. There are two reasons for this: (1) to compensate for variability in the duration of gestures and (2) to reduce noise due to unknown hand motions in the “no gesture” class thereby simplifying the task of the recognition classifier. However, processing the visual data with a detection classifier limits the accuracy that is achievable by the system to the accuracy of the upstream gesture detection classifier.
Second, dynamic hand gestures generally contain three temporally overlapping phases: preparation, nucleus, and retraction, of which the nucleus is the most discriminatory. The other two phases can be quite similar for different gestures and hence less useful or even detrimental to accurate gesture classification. Therefore, classifiers often rely primarily on the nucleus phase for gesture classification.
Finally, humans are acutely perceptive of the response time of user interfaces, with lags greater than 100 ms perceived as annoying. This presents the additional challenge of detecting and classifying gestures immediately upon (or preferably before) completion of the gesture to provide immediate feedback to users. There is a need for addressing these issues and/or other issues associated with the prior art.