Gesture detection or recognition is a technology particularly suited for interactive systems for several reasons. First, gestures are commonly used in natural conversation and non-verbal communication, thus, in general people feel comfortable when gesturing. Moreover, gestures form a vocabulary or language, so they can be used to convey specific information directly interpretable by the computer, for example, to activate distinct events in an interactive application. Also this kind of interaction would be naturally understood and easily remembered by the user, as again, it is our form of non-verbal communication and we feel familiar with its usage. For this reason, interfaces based on gestures can be simpler than other kinds of interaction where the user feels less comfortable and require a longer learning curve.
Gesture recognition technologies focus on the recognition of a particular body movement or pose by means of the classification of sensor data using a computer program. Data classification methods require examples of each class in order to compute models that allow the recognition of the gesture from novel input data. Data collection and annotation is a cumbersome task that in the case of body gestures requires substantial human effort. The present invention allows to collect and annotate data examples and train gesture classifiers with a considerable reduction of human effort in such tasks.
Conventionally, gesture recognition technologies require tracking of body parts, i.e. skeleton tracking as described in U.S. Pat. No. 8,824,781. In these cases, gesture recognition is based on the classification of trajectories and relative positions of the joints of the skeleton (e.g. Microsoft SDK, Kinect Studio).
Such gesture recognition approaches depend on the accuracy of the skeleton tracking method employed in order to classify the gesture. Moreover, depending on the number of skeleton joints obtained, some gestures cannot be recognized, because the lack of resolution. For example, in case that the skeleton tracking approach does not capture finger joints position, gestures involving finger pose will not be recognized. Capturing skeleton joints pose is a computationally expensive task, especially when attempting to compute hand fingers pose, due to the high number of joints and degrees of freedom. Also, existing methods for hand skeleton tracking only capture the pose at close range to the input sensor (<1 meter). For all these reasons, gesture recognition based on skeleton tracking is not suitable for certain scenarios. This is the case when the application requires low computational load, or requires response in cases where the full body might be not visible, or when gesture recognition is needed at distances greater than 2 m from the sensor. The present invention concerns a system able to detect still gestures, which only rely on the classification of depth image patches. The system does not require skeleton tracking and determines the gesture relying on local shape. In this manner, gestures involving fingers, hands, arms or any body part can be recognized at distance from the sensor, given that the pixel resolution of the sensor is enough to capture the shape details that distinguish the gesture. Operating in this manner the computational load is lower than for the task of skeleton tracking and, moreover, the gesture can be recognized in case of occlusions or partial view of the body.
Similarly to the present invention, some hand gesture recognition approaches rely on depth image features in the local context of the hand such as the described in U.S. Pat. No. 7,340,077. Such approaches require to previously perform hand segmentation. This segmentation step is prone to errors, and in addition, implies that the system is not able to recognize gestures in which other parts of the body are involved, such as head or arms. This fact limits its application to a subset gestures. The present invention does not require such segmentation step, which extends the applicability of the system to further cases and scenarios.
The present invention relies on still gesture localization. The detector obtains the position of the gesture both in the image and in 3D world domain and identifies the gesture class. In the task of training the gesture detector, the main difficulty is that the training data is highly unbalanced, the training set is usually composed by few positive examples and a huge amount of negative examples. The techniques for negative sample mining described in J. Gall, A. Yao, N. Razavi, L. Van Gaol, and V. Lempitsky “Hough forests for object detection, tracking, and action recognition” TPAMI, 33(11):2188-2202, 2011 are proposed to overcome this issue. In such methods, the best training samples are automatically chosen. Even though, the performance of the method is still highly influenced by the set of negative examples available in the training set.
Once a detector is trained, one can realize that the detector fails for certain examples. An experimental solution to this problem is to collect more data based on the test failures, and train again the detector. This process can be tedious, as training can be slow, and once it finishes one would require to manually test and record new data, and then train again and so, in an iterative manner. Also the training set will grow and, as a consequence, the training process will be slower and the memory requirements will also grow.
Batch learning methods are also known, such as the one proposed in Alcoverro Marcel et. al. “Gesture control interface for immersive panoramic displays”, Multimedia tools and applications, Kluwer academic publishers, vol. 73, no. 1, July 2013 pages 491-517. However, batch learning methods require a high amount of resources, such as computational speed and memory. Therefore, unless a high amount of resources are provided, batch learning methods do not provide a reliable level of accuracy.
It is then an objective of the present invention to present a method for setting a tridimensional shape classifier and a method for shape detection using said classifier with a better performance and reliability.