1. Field
Embodiments of the invention are generally directed to techniques for analyzing individuals depicted in recorded video sequences. More specifically, embodiments of the invention are directed to modeling human-human interactions to generate 3D pose estimations from 2D (i.e., monocular) images.
2. Description of the Related Art
Computer vision refers to a variety of techniques used to extract and interpret information from images. That is, computer vision is the science and technology of programming computer systems to “see.” The computing system may use the extracted information to solve some task or to “understand” a scene depicted in the images.
One aspect of computer vision includes estimating 3D geometry from 2D (i.e., monocular) images. For example, recorded video typically captures images of a real world 3D scene projected into a sequence of 2D images (at some frame rate). Computer vision provides approaches for reconstructing the 3D structure or other information about the 3D scene from such a 2D video sequence.
Estimating and tracking human pose has been a focal point of research in computer vision for some time. Despite much progress, most research has focused on estimating pose for single well separated subjects. Occlusions and part-person ambiguities that arise when two people are in close proximity to one another make the problem of pose inference for interacting people a challenging task. One approach—tracking-by-detection—has shown results in some real world scenarios, but is typically restricted to tracking individual people performing simple cyclic activities (e.g., walking or running). Despite successes, tracking-by-detection methods generally ignore contextual information provided by the scene, objects, and other people in the scene. As a result, in close interactions, independent pose estimates for multiple individuals compete with one another, significantly degrading the overall performance of pose estimation tools.