Human to computer interactive systems and methods using gesture and movement recognition are technologies known from the computer vision domain. The recent range finding camera device based interactive systems, for example, a system using a 3D structured light camera, are also known to enable 3D gesture and movement recognition of users when performed in the field of view of the particular imaging device. In such systems, a human user is typically standing at least at 1 m from the camera and his movements and/or gestures need to be bold, that is, large enough to be detected by the 3D camera, the resolution of which is predefined and limited. For example, a typical 3D camera resolution is in accordance with quarter video graphics array (QVGA), that is, 320*240 pixels, and, therefore limits the accuracy of detectable movement or the capacity of detecting small objects, such as fingers, when they are situated far too away from the 3D imaging device. In such systems, it is generally not possible to detect and track reliably the hands of a user, and more particularly, the fingers of the hand at such distances. More precisely, most of the current 3D imaging devices are built to enable full skeletal tracking of a user in the range of between 1 m and 5 m from the camera. Moreover, most of these 3D imaging devices have technical specifications which do not allow them to perform measurements in the range from between 0 m and 1.5 m from the camera sensor.
However, with the development of some new technologies, among which is the 3D time-of-flight (ToF) cameras, it becomes possible to detect and to track features at different distances from the camera sensor, including features, such as the hands and/or fingers of a user, at distances less than 1 m, typically less than 0.5 m, form the camera sensor. This is termed “close interaction”, and this is particularly adapted for enabling touch-less hand and finger gesture based interactions in a desktop environment or in other environments, such as in an automotive environment.
It is known to detect the presence of a plurality of parts of a human being within an image embedding colour channels for each pixel using detection of skin colour. Such a method is described in an article entitled “Real Time Tracking of Multiple Skin Colored Objects with a Possibly Moving Camera” by A. A. Argyros & M. I. A. Lourakis, in Proceedings of the European Conference on Computer Vision (ECCV '04), Springer-Verlag, vol. 3, pages 368 to 379, May 11 to 14, 2004, Prague, Czech Republic. Here, blobs are detected based on a Bayesian classifier which is bootstrapped with a small training set of data, and, includes skin-colour probabilities to enable the classifier to cope with illumination changes as colour measurements may be disturbed by environmental illumination. This detection method basically enables the detection of hands and face of a user in a close interaction environment, and may when be combined with some tracking capabilities of these detected skin-colour blobs so that information about the position of the blobs in the image can be provided. Such information can then be used to provide input to gesture recognition systems, for example, to control in a simple way, the mouse of a computer without physical contact.
In another disclosure, a fast and robust segmentation technique based on the fusion of 2D/3D images for gesture recognition is described by S. E. Ghobadi, O. E. Loepprich, K. Hartmann, & O. Loffeld in an article entitled “Hand Segmentation using 2D/3D Images”, Proceedings of Image and Vision Computing New Zealand 2007, pages 64 to 69, Hamilton, New Zealand, December 2007. A 3D ToF camera is used for generating an intensity image together with range information for each pixel of a Photonic Mixer Device (PMD) sensor. The intensity and range data are fused to provide input information to a segmentation algorithm relying on the K-Means mixed with expectation maximisation technique (KEM). The KEM combines K-Means clustering techniques with expectation maximisation technique so that centres of natural clusters are located in an initial step using the K-means, and, in a second step, soft probabilistic assignments of expectation maximisation are used to find local maxima iteratively and thus refine position and borders of the found clusters. In a simple environment in which the background is uniform and at a sufficient distance from the objects to be segmented, the fused data enables a proper segmentation in real time of the hand of a user from his/her body, face, arm or other objects in the scene whatever the illumination conditions. Robustness of the method in the environment being considered is said to satisfy a use as the first step of processing technique in 2D/3D gesture tracking and gesture recognition. In another article entitled “Tracking the Articulated Motion of Two Strongly interacting Hands” by I. Oikonomidis, N. Kyriazis & A. A. Argyros, (http://www.ics.forth.gr/˜argyros/mypapers/2012 06 cvpr twohands.pdf), particle swarm optimisation (PSO) is used to process signals from a RGB-D sensor, that is, a colour sensor that also provides depth information. The method performs marker-less visual observations to track the full articulation of two hands interacting with one another in a complex, unconstrained manner. The method described achieves complete detection and tracking of two interacting hands by directly attributing sensory information to the joint articulation of two synthetic and symmetric 3D hand models of known size and kinematics. For given articulations of the two hands, a prediction of what the RGB-D sensor would perceive is obtained by simulating the image acquisition process, for example, by producing synthetic depth maps for the specific camera-scene calibrations. Having established a parametric process that produces comparable data to the actual input, tracking is performed by searching for the parameters that produce depth maps which are most similar to the actual input. Tracking is performed in an online fashion, where, at each step and for every new input, an optimisation step is performed, and using a variant of PSO, the discrepancy between the actual RGB-D input and simulated depth maps generated from hypothesised articulations is minimised, the best scoring hypothesis constituting the solution for the current input. The discrepancy measure is carefully formulated so that robustness is achieved with computational efficiency and temporal continuity being utilised at each optimisation step.
Forearm and hand segmentation is also described in “Fingertip Detection for Hand Pose Recognition” by M. K. Bhuyan et al., International Journal on Computer Science and Engineering (ISCSE), Vol. 4, No. 3, March 2012. Here, a method for fingertip detection and finger type recognition is proposed in which fingertips are located in the hand region using Bayesian rule based skin colour segmentation. Morphological operations are then performed in the segmented hand region by observing some key geometric features. A final probabilistic modeling of the geometric features of finger movements is then performed so that a finger type recognition process is significantly made more robust, especially in the context of providing valuable information for finger spelling in sign language recognition and gesture animation.
In an article entitled “Real-time hand posture recognition using range data” by S. Malassiotis & M. G. Strintzis, Image and Vision Computing, 26 (2008), pages 1027 to 1037, a method is described in which arm segmentation, hand-forearm segmentation, hand pose estimation and gesture classification are performed in order to recognise complex hand postures such as those encountered in sign language alphabets. Three-dimensional image data is captured and processed in accordance with the method by utilising three-dimensional hand geometry to recognise sign language postures and/or gestures.
WO-A-03/071410 describes another method for enabling a person to interact with an electronic device using gestures and poses. The method includes obtaining position information for a plurality of discrete regions of a body part of the person. The position information includes depth values of each discrete region of the body part relative to a reference position. A depth image of a 3D scene is captured using a 3D sensor system and processed to segment one or more regions of interest from the background of the scene, each region of interest including a hand whose gestures are to be detected. Each gesture is classified and compared to a set of different hand gestures.
Despite hand detection and modelling being widely addressed in the literature, none of the documents described above provides for the detection and tracking of the parts of the hands and their parameters including fingers, palm and orientation, at distances close to the imaging device using a depth map from a 3D imaging sensor as input information, and without the use of any colour information, nor the use of any hand model or skeletal representation having articulations.