1. Technical Field
The present invention relates generally to an apparatus and method for estimating the joint structure of a human body. More particularly, the present invention relates to an apparatus and method for estimating the joint structure of a human body, which can estimate the skeletal structure of a relevant human body having assumed any posture in a specific space by using multi-view images that have been acquired by multiple cameras arranged around the human body.
2. Description of the Related Art
Technology for modeling the skeletal structure of an entity based on a skeletal system is used to estimate the joint positions, skeletal structures, posture information, etc. of an actual skeletal system from information about the deformation of the surface shape of the entity depending on the motions of joints.
In relation to this, conventional technologies include a paper published by Pin-Chou Liu, Fu-Che Wu, Wan-Chun Ma, Rung-Huei Liang, and Ming Ouhyoung and entitled “Automatic Animation Skeleton Construction Using Repulsive Force Field (hereinafter referred to as “Pin-Chou Liu”)” (IEEE Trans. Proceedings of the 11th Pacific Conference on Computer Graphics and Applications, October 2003, pp. 409-413), and a paper published by Lawson Wade and Richard E. Parent and entitled “Automated Generation of Control Skeletons for Use in Animation” (The Visual Computer, vol. 18, no. 2, March 2002, pp. 97-110). These technologies disclose a scheme which, in order to realize a three-dimensional (3D) animation of an entity in the field of computer graphics, estimates a 3D skeletal structure suitable for the shape of the entity by extracting a 3D skeleton from a 3D polygon model obtained by modeling the surface shape of the entity, and binds the estimated skeletal structure to individual vertexes constituting the polygon, so that the surface shape of the entity is controlled via the control of joints.
Further, in the field of computer vision, in order to recognize an action using the motion capture of an entity, information about the deformation of the 3D shape based on the motion of the entity is acquired by various camera sensors, the 3D shape information of the entity is estimated from the acquired image information, and the positions and postures of individual joints in a skeletal structure are estimated based on the predefined skeletal structure of the entity from the estimated entity 3D shape information, so that the action of the entity is analyzed.
The above two types are similar to each other in that both estimate the skeletal structure of an entity, but are different from each other in the configuration of defining shape information used to estimate a skeletal structure or the characteristics of the skeletal structure.
The estimation of a skeletal structure in a polygon model which is mainly used in the field of graphics is implemented on the assumption that the ideal 3D surface shape information of an entity was input. In contrast, the estimation of a skeletal structure in the field of computer vision obtains surface shape information from image information about an actual entity obtained using an image sensor, so that there is always the possibility that the surface shape information of the actual entity will be distorted. As a result, a problem arises in that when an approach used to estimate a skeletal structure in the field of graphics is applied without being changed to the field of computer vision, it is difficult to accurately estimate a skeletal structure.
In regard to this information distortion, in order to not only ensure robustness of the estimation of the skeletal structure of an entity, but also recognize an action via the estimation of the postures of joints in the skeletal structure, most technologies in the field of computer vision use a method of predefining the 3D shape information and skeletal structure of an entity whose skeletal structure is to be estimated and controlling the postures of respective joints of the predefined skeletal structure, so that posture control values for the joints are detected to minimize a difference between the shape deformation information of the shape information model that has been simulated and predefined and shape information that has been obtained from input image information, thereby estimating the skeletal structure of the entity. In this case, as the predefined shape information, the shape information of the entity which has been obtained from the 3D scanning or image information of the entity and which is distorted has been used in most cases. For the skeletal structure, a skeletal structure model including information about the positions or lengths of joints predefined by a user in accordance with the actual skeleton of the entity has been used.
Further, approaches in the field of such computer vision have mainly used tracking methods dependent on the joint information of temporally adjacent image frames to estimate the positions of joints in the skeletal structure, that is, to capture motions. However, these methods are problematic in that errors are propagated when tracking is erroneously performed on adjacent image frames.
In order to solve the above problems, a paper published by Jamie Shotton, et al. and entitled “Real-Time Human Pose Recognition in Parts from Single Depth Images” (presented at IEEE Computer Vision and Pattern Recognition 2011, June 2011) proposes a method of independently estimating the postures of joints for respective image frames by using data-based training.