Human pose estimation is crucial for many computer vision applications, including human computer interaction, activity recognition and video surveillance. It is a very challenging problem due to the large appearance variance, non-rigidity of the human body, different viewpoints, cluttered background, self occlusion, etc. Single image-based pose estimation methods known in the art can be applied to each video frame to generate initial pose estimations and a further refinement through frames can be applied to make the pose estimations consistent and more accurate. However, due to the innate complexity of video date, the problem formulations of most video-based human pose estimation methods are very complex (usually NP-hard), therefore, approximate solutions have been proposed to solve them which result in sub-optimal solutions. Furthermore, most of the existing methods model body parts as a tree structure and these methods tend to suffer from double counting issues, wherein symmetric parts, for instance left and right ankles, are easily mixed together.
Kinect is known in the art as a motion sensing input device that can be used with Microsoft® Xbox 360 and Xbox One video game consoles and with Windows® PCs. Kinect utilizes a webcam-style add-on peripheral that allows users to control and interact with their console/computer without the need for a hand-held game controller. In general, the webcam provides an unconstrained video and the motion sensing input device provides a user interface to the gaming system using human body poses and gestures.
In the computer/digital gaming industries, such as those systems using unconstrained video and motion sensing input devices, it is very important to estimate the human poses to provide a better human-computer interface. Additionally, in the field of video surveillance and action/activity recognition, it is also crucial to be able to estimate human poses in unconstrained video feeds to allow further automatic analysis of the video.
Systems requiring video cameras and complex motion sensing input devices are prohibitively expensive, which severely limits the application of the systems. In addition, other human pose estimation methods known in the art that utilize a standard video camera are mainly designed for the estimation of still images, in contrast with video.
Accordingly, what is needed in the art is a more efficient and cost-effective solution for estimating human poses in unconstrained video.