Estimating human poses by a computer is the basis of a gesture-controlled human-machine interaction. Body or hand gestures are captured by cameras, the captured digital images are processed in the computer and interpreted as commands that are eventually carried out by the computer or by equipment controlled by it. The human user no longer requires separate input equipment if he masters the commanding gestures.
Among particularly interesting areas of application of gesture control are on the one hand the field of medical surgery where the operating physician would like to have direct control of auxiliary equipment (e.g. imaging devices such as ultrasound or MRT), but cannot touch any control devices with his hands in order to safeguard sterility, and on the other hand the field of public information terminals or ticket machines that at present are still equipped with the rather unhygienic touch pads. A further field of application that has already been opened up commercially is the computer game sector.
The purpose of a gesture-control method is to give the optical image of a person a machine-interpretable meaning. This requires an apparatus that images the person such that it can be evaluated electronically, compresses this image in terms of its information content and finally translates the compressed image of the person into a machine-interpretable output. The output of the apparatus can consist of control commands for downstream apparatuses to be controlled. However, it is also possible that it comprises only the compressed image information that is fed to a downstream unit for interpreting this information.
An example for compressed image information is for example the continuous output of the position coordinates of the right hand of the person in a 3D coordinate system. In the process it is often sufficient to output only coordinates of a single point for the hand position, e.g. if the entire body of the person is imaged. If the motion of the person is imaged by an image sequence, the apparatus mentioned for example provides the 3D coordinates of predetermined body parts that change over time—during the motion. The coordinates can serve as variable inputs into a program that e.g. accordingly controls a cursor position on a screen.
During image segmentation, all recorded image data (measurement values) that cannot be assigned to the imaged person are removed, that is in particular image elements that concern the background. Such image elements have to be excluded from further evaluation.
Image segmentation using two-dimensional data is difficult above all if the user is imaged in front of a complex background—for example further persons move in the background—or if he makes gestures where he moves extremities towards the camera such that they conceal part of his torso. Since gesture control is to take place in real time and pose estimation usually is to be possible at a video frame rate of 25 Hz or above, it is necessary that image segmentation can take place within a few milliseconds. For this purpose, depth sensor cameras can be used that cannot only measure, as conventional cameras, a brightness image, but also the distance of the camera from the object.
A known depth sensor camera is called time-of-flight camera (TOF). It emits infrared light whose intensity is modulated sinusoidally. The phase displacement between the emitted light and the light reflected by the object is measured in each pixel. From this phase displacement, the propagation time (“time of flight”) of the light and thus the distance of the camera from the object point can be calculated. A TOF camera provides a depth map that is in registry with a brightness image (in TOF nomenclature often called amplitude image).
A further method for simultaneously obtaining image and distance measurement values is based on structured light that is irradiated onto the object to be measured and reflected by it. A camera detects the reflected light—usually at a different angle than the angle of arrival—and registers the change of the structure of a projected pattern due to the position or extent of the reflected object surface. For example it is possible to calculate from the curvature of a reflected line captured by the camera that was originally projected onto the object as a straight line, a doming of the reflected surface, that is a distance variable relative to the projector and/or camera. In a similar way, a spatially divergent beam bundle is suitable that projects points in a three-dimensional scene, by detecting the point reflections and determining the distances between these. On a face located closer to the projector, the point distances are less than on a face in the image background. This is used for measuring the distances of faces or face areas from the projector.
According to this, a depth sensor camera is an apparatus that also provides distance information for each imaged object point in addition to a two-dimensional brightness image, so that in addition the position of all imaged object points along a depth axis—that usually coincides with the optical axis of the camera—is measured. The electronic image having distance information recorded using a depth sensor camera is also termed a two and a half dimensional image (2½ D) of the scene. The apparatuses mentioned above are only examples how 2½ D images can be produced and do not necessarily represent a final list.
Among others, it can be gathered from the printed publication WO 2010/130245 A1 how image segmentation of 2½ D images can take place correctly. Image segmentation orders the brightness values detected by the camera pixels according to the distance values measured simultaneously and registered by the pixels. Only brightness values of the foreground remain in the further evaluation, it being assumed that for the purpose of improved visibility, the person to be observed is closest to the camera. The brightness values of the foreground thus result from imaging the body surface of the person. By means of the camera projection parameters known per se, the imaged object points can then each be assigned a set of 3D coordinates. A list of 3D coordinates is then obtained that comprises all the points of the person that are directly visible for the camera. Inside this “cloud” of points in the 3D space there is the actual person, and inside the 3D point cloud there are also the relevant coordinates of the predetermined body parts that are desired to be determined for the purpose of gesture control.
The second part step of information compression can thus be seen in determining from the 3D point cloud, determined by image segmentation and representing the person, a reduced set of point coordinates that describes as best as possible an entire pose of the person and is suitable for machine interpretation. This step is also called pose estimation. One aim of pose estimation is here the robustness of the reduced data set, i.e. small changes of the human pose shall also lead only to small changes in the data sets describing the pose. In particular the coordinates describing the human body parts shall, as far as possible, move on temporally continuous trajectories so that an unambiguous correlation of the coordinates with these body parts is given at any time.
A known and generally accepted approach is the definition of a skeleton model of the person that is to be fitted as fast as possible into the 3D point cloud.
WO 2010/130245 A1 discloses a method for real time-capable pose estimation from sequences of 2½ D images, where a skeleton model is proposed that is explained as a topology of nodes and edges. The edges that can be described as pairs of nodes code a neighborhood structure between the nodes. The nodes are fitted into the previously determined point cloud by applying a learning rule for training a self-organizing map (“SOM”).
In the exemplary embodiment of WO 2010/130245 A1, the upper part of the human body is modelled using a topology from 44 nodes and 61 edges. The 3D point cloud representing the person comprises approximately 6500 data points (depicted in the real 3D space in which the person observed exhibits a defined size independently from his distance from the camera), of which approximately 10% are used for training an SOM. All nodes of the topology can be directly regarded as an SOM, while specifying the edges can be regarded as a special requirement or limitation for the learning rule.
The topology is trained separately for each frame of a video sequence, the training result of a frame at the same time serving to initialize the training of the following frame of the sequence. During initialization of the first frame of a sequence the size of the topology is preferably matched to the size of the person in front of the camera by a one-off scaling, and its centre of gravity is displaced into the centre of gravity of the 3D point cloud. If the size of the topology has once been selected correctly, it does not require further adapting during the on-going method, since the method functions scale-invariantly. Training the frames takes place by applying a pattern-by-pattern learning rule having the following steps:    a. randomly selecting a data point X of the 3D point cloud;    b. determining that node of the topology that exhibits the minimum distance from X;    c. determining all neighbouring nodes of the node determined under b. according to the edge specification of the topology;    d. displacing the nodes determined under b. and c. in the direction of X (see in this respect the equations (2) and (3) of WO 2010/130245 A1),    e. the displacement vectors being multiplied by learning rates that exhibit precisely half the size for the nodes determined under c. as for the nodes determined under b. (see in this respect WO 2010/130245 A1, p. 13, paragraph 4);    f. repeating the steps a. to e. for a predetermined number of learning steps while gradually reducing the learning rates.
It is convenient to specify a maximum number of learning steps for each frame for carrying out the pose estimation—i.e. in this case fitting the skeleton model into the 3D point cloud and reading out all relevant nodes positions—during a predetermined time interval. In this way, image sequences can also be analysed at the video frame rate or even faster.
Although the algorithm of WO 2010/130245 A1 fulfils well the object of real-time pose estimation, it still does exhibit a few weaknesses that are partly mentioned in the printed publication itself. In particular when analysing scenes where the person brings his arms together or crossed them in front of the body, the learning rule can lead to misinterpretations—that can be corrected during the course of further iterations—if individual nodes are pulled away far from their actual neighbours in the topology. This effect is countered in WO 2010/130245 A1 with an anchoring point in the model torso and a secondary condition of the learning rule that inhibits nodes displacements away from the anchoring point beyond a predetermined threshold.
The teaching of WO 2010/130245 A1 further also shows difficulties with the precise position determination of human joints, shoulders, and hips, that can in each case be represented by several different nodes. The skeleton model outlined in WO 2010/130245 A1 exhibits relatively many nodes, whose number cannot be readily reduced to 20 or less without accepting considerable errors in the pose estimation. Systems that are available on the market for gesture control by means of depth sensor cameras already operate using skeleton models having 15-20 nodes rather designed according to the human anatomy. By reducing the node count, a higher processing speed of the camera images can also be obtained.
Anatomically motivated skeleton models are additionally suited for falling back on stored movement patterns (templates) for detecting fast and complex movements (e.g. swinging a golf club). In these cases, the gesture-control software looks for the most likely match of the detected pose change to a previously stored movement sequence and uses this known template for the actual control. This technology is already used in computer games, but it is resource intensive. Last but not least, producing the stored movement data already gives rise to considerable costs.
Gesture control by means of SOM training on the other hand completely dispenses with templates and is rather solely based on the real time-capable detectability of the movement continuity. Due to learning rules that can be implemented efficiently, it has the potential to reliably detect even fast human movements and at the same time maintains the universal applicability so that possibly complex matching of the software to the measurement task is omitted.