The estimation of the human pose in three dimensions (3D) has been studied in a number of scientific papers. Some of these papers have focused on the reconstruction of the human pose from 2D data acquired by a conventional camera.
In Agarwal A and Triggs B, Recovering 3D Human Pose from Monocular images; IEEE Transactions on Pattern Analysis and Machine Intelligence, 28 (1) (2006) 44-58, the pose is obtained from shape descriptors of silhouettes of the human body.
The authors Rosales R and Sclaroff S of Inferring Body Pose without Tracking Body Parts; Proceedings of Computer Vision and Pattern Recognition (2000) 721-727, map simple visual features of a segmented body onto a series of possible configurations of the body and identify the pose by the configuration which is most probable in view of the given visual features.
A further approach in Shakhnarovich G, Viola P and Darrell T, Fast Pose Estimation with Parameter-Sensitive Hashing, Proceedings of the International Conference on Computer Vision (2003) 750-757, uses a large data base of exemplary images of human poses and the authors use parameter-sensitive hashing functions by which the exemplary pose which is most similar to a given pose is searched in the data base.
A major disadvantage of all of the methods based on 2D data is that the segmentation of the person whose pose is to be estimated is difficult, in particular in scenes with a complex background. Last but not least, this is at the sacrifice of processing power and hence to speed.
Another problem of 2D images is the detection of extremities directed at the camera but hide part of the upper part of the body in the 2D projection. In such a situation, the extremities can no longer be detected in the silhouette and detection becomes time-consuming.
A pose estimation based on 3D data is described, for example, in Weik S and Liedtke C-E, Hierarchical 3D Pose Estimation for Articulated Human Body Models from a Sequence of Volume Data; Proc. of the International Workshop on Robot Vision (2001) 27-34. Here, the 3D volume of a person is acquired by means of a multi-camera configuration using the shape-from-silhouette method. Subsequently, a 2D projection of the volume is calculated by means of a virtual camera and a model of the human skeleton is adapted to this projection. To estimate the 3D pose, the model of the skeleton is then retransferred into the 3D space by inverting the 2D projection.
The disadvantages of the method of Weik and Liedtke are that the acquisition of the 3D volume has to be carried out within a special device having multiple cameras in front of a uniformly green background and the calculation of the 3D volume is a time-consuming technique.
A further approach for the calculation of a 3D skeleton model is to thin out volumetric data directly within the three-dimensional space [Palagyi K and Kuba A, A Parallel 3D 12-Subiteration Thinning Algorithm; Graphical Models and Image Processing, 61 (4) (1999), 199-221]. The human pose can then be estimated by means of the skeleton model.
A method for pose estimation based on stereoscopy was published by Yang H-D and Lee S in Reconstructing 3D Human Body Pose from Stereo Image Sequences Using Hierarchical Human Body Model Learning; ICPR '06: Proceedings of the 18th International Conference on Pattern Recognition (2006) 1004-1007. The authors introduce a hierarchical model of the human body. Both the silhouette and the depth information are used for a given photograph to find the pose with the best match in the data base.
A disadvantage of this approach is the technique of stereoscopy which involves a long processing time. In addition, stereoscopy provides reliable depth data only if the respective scene has a sufficient texture.
A pose estimation by means of a self-organizing map (SOM) is described by Winkler S, Wunsch P and Hirzinger G in A Feature Map Approach to Real-Time 3D Object Pose Estimation from Single 2D Perspective Views; Mustererkennung 1997 (Proc. DAGM) (1997), 129-136.
A SOM is a special neural network which can be trained for a task. The SOM is used here to learn a map of a 64-dimensional feature space into the three-dimensional space of possible rotations of a rigid object. Artificially generated views of the rigid object are used as the training data. 2D color photographs of the object form the basis of the application of this method. Based on the color information, the object is localized within the images and is cut out. Subsequently, the image is processed by a Sobel operator which is responsive to sudden differences in contrast within the image, so-called edges, and high pixel values are allocated to respective regions. In contrast, pixel values close to zero are allocated to uniformly colored regions. Finally, an image of 8×8 pixels, whose 64 values correspond to the feature vector, is generated from this edge image by reducing the resolution. The SOM three-dimensionally maps the resulting feature vector onto one of 360 possible orientations.
The disadvantages of the method of Winkler et al include the fact that this method exclusively treats rigid objects and therefore cannot be used for the 3D estimation of the human pose. In addition, this method is essentially based on the extraction of edge information by means of the Sobel operator. In the case of persons who normally wear different clothes and are photographed in complex natural scenes under varying conditions of illumination, it can be assumed that a unique representation based on edge information is not possible.
Another alternative for pose estimation is based on time-of-flight (TOF) cameras. A 3D TOF camera does not only provide a brightness image as usual cameras do, but can additionally measure the distance from the object. The camera emits infrared light which is modulated sinusoidally. In each pixel, the phase shift between the emitted light and the light reflected from the object is measured. From this phase shift, the time of flight of the light and hence the distance of the camera from the object point can be calculated. A TOF camera provides a depth edge which is perfectly registered with a brightness image (often referred to as “amplitude presentation” in the TOF nomenclature). Therefore, it is an attractive sensor for a large number of applications in image processing. A TOF camera produces only a 2½-dimensional image of the scene but this is done at a high image rate and without needing any additional computing time.
In Zhu Y, Dariush B and Fujimura K, Controlled Human Pose Estimation from Depth Image Streams; CVPRW '08 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2008), 1-8, a number of anatomical landmarks is three-dimensionally traced over time. The pose of a movable human model is estimated from the three-dimensional positions of these landmarks. The model is in turn used to resolve ambiguities in the detection of the landmarks as well as to produce estimations of the position of undetected landmarks. The method of Zhu et al simulates basic conditions such as the maximum bending angles of the joints and the avoidance of the mutual penetration of various parts of the body, amongst other things. Despite the complexity of the model, the method runs at a frame rate of at least 10 frames per second.
However, as conventional video sequences have a frame rate of 25 Hz, a real-time-capable pose estimation at an appropriate frame rate would be desirable. It is, however, the general object of the invention to make image information (color values on pixels) interpretable for machines after they have been electronically recorded or translated. The estimation of human poses is a subfield of computer vision which basically has two problems:                (1) The pose has to be determined rapidly and has to be updated rapidly in the event of a change. Here, a video rate of 25 Hz is desirable.        (2) The person whose pose is to be determined is usually not in an ideal environment but rather in front of an unknown or at least hardly controllable background.        
In addition, the instrumentation expenditure for the solution of the task should not be too large. A common PC and a relatively simple camera configuration should be sufficient.
Well-known from prior art are TOF cameras which allow to solve the problem of the separation of foreground and background in a particularly simple manner. A TOF camera produces a 2.5-dimensional image (2 dimensions plus distance from the camera). Object points which are hidden by other object points along the line of sight of the camera cannot be detected. Of an object, only its front visible surface is available as an aggregate of points in the 3D space for inferring the pose of the 3D object.
In the above-mentioned paper of Weik S and Liedtke C-E, Hierarchical 3D Pose Estimation for Articulatd Human Body Models from a Sequence of Volume Data; Proc. of the International Workshop on Robot Vision, 2001, no TOF camera is used but not less than 16 electronic cameras are used and a monochromatic background is needed to three-dimensionally model a person by means of his or her silhouettes from different directions.
In the paper of Haker M, Böhme M, Martinetz T and Barth E, Deicitc gestures with a time-of-flight camera; The 8th International Gesture Workshop, Feb. 25-27, 2009, at the ZiF (Center for Interdisciplinary Research) at the Bielefeld University, Germany, a TOF camera is used to rapidly detect a person in front of an arbitrary background. However, the interpretation of the “pose” is almost rudimentary and they write, for example: “We find the head and hand using a simple but effective heuristic: The initial guess for the hand is the topmost pixel in the leftmost pixel column of the silhouette; the head is the topmost pixel in the tallest pixel column.”
In other words: No matter which part of the body is farthest to the right within the image, the machine regards that part as the right hand. Actually, the use of this very simple pose estimation requires that the person always holds his or her right arm clearly away from the body if he or she wants to command the machine by moving that hand, for example. The approach described here cannot use gestures such as arms folded in front of the body.
Finally, the article of Breuer P, Eckes C and Müller S, Hand Gesture Recognition with a Novel IR Time-of-Flight Range Camera—A Pilot Study; Proceedings of the Mirage 2007, Computer Vision/Computer Graphics Collaboration Techniques and Applications, Rocquencourt, France, Mar. 28-30, 2007, pp 247-260, is intended to determine the pose of a human hand as rapidly and exactly as possible from an aggregate of points detected by a TOF camera. It uses an anatomical model of the hand, which is fit into a portion of the aggregate of points, which had been isolated in advance as representing the hand.
This paper determines seven degrees of freedom (3 coordinates, 3 angles of rotation, 1 scaling factor) to obtain the best possible fit (minimization of the cost function K). Here, the hand model itself is rigid and is not changed at any time. A rotation, for example, has an effect on all nodes of the hand model at the same time without shifting the model nodes in relation to each other.
The method described there might produce a good estimation of the person and of the twist of his or her hand within the 3D space. But as soon as the person to be estimated moves his or her fingers distinctly, the method would not work any longer without any problem.