1. Introduction
Pose estimation or motion capture is a fundamental problem in computer vision and graphics [13, 14] with many applications such as character animation in games and movies, controller free interfaces for games [12] and surveillance. Due to the complexity of the problem, there still does not exist a universal solution to all the applications. The solutions strongly depend on the conditions and on the constraints imposed on the setup. In general, the more constraints are imposed on the setup, the more accurately the pose estimation can be computed. In real world scenarios it is often very difficult to impose constraints on the setup. However, many practical applications are based on these scenarios. For instance, Germann et al [11] shows how accurate pose estimation can be used for high quality rendering of players from an arbitrary view-point during a sports game using only video footage already available in TV broadcasts. In addition to applications in rendering, accurate pose estimation of players during a game can also be used for bio-mechanical analysis and synthesis as well as for game statistics or even the porting of a real game play into a computer game.
2. Related Work
Many current commercially available motion capture systems [22] typically use optical markers placed all over the body to track the motion over time. These systems are very accurate and can capture all kinds of body poses as well as facial expressions. However, they are invasive and work under controlled environment. Therefore, they are only suitable for a specific range of applications.
Markerless motion capture methods have received a lot of attention in the last decade [13, 14]. Based on the type of footage used, the markerless pose reconstruction (or motion capture) problem can be roughly categorized into two groups [24]: using video sequences from one camera or using footage from multiple calibrated cameras. Pose estimation from monocular video sequences [2, 3, 24, 17, 1, 18] can be more convenient for some applications as it imposes less restrictions on the user, but it has an inherent depth ambiguity. This ambiguity can be solved using structure from motion approaches, a very difficult problem in vision [13, 14]. Structure from motion algorithms typically rely on high-resolution scenes containing a lot of detail which we typically do not have in our scenario or setup, that is, in sports scenes. Efros et al. [9] also process soccer footage. Even though their work focuses more on action detection, they showed that even on low resolution data a rough 2D pose can be estimated.
Another major challenge in pose estimation are occlusions. If the footage comes from a single camera it is very difficult to resolve them. Using multiple cameras increases the probability to have an unoccluded view of the same subject. The higher the spatial coverage by cameras is, the fewer ambiguities remain. Moreover, sport broadcasts already use multiple cameras on the field. Therefore, we can leverage this information to compute a more accurate 3D pose estimation.
Most methods for multiple views 3D pose estimation use tracking algorithms to reconstruct the pose at time t from the pose at time t−1 [4]. The tracking can be done either using optical flow [4] or stereo matching [6]. These methods can provide very accurate pose estimation, but they generally work in a controlled environment, require a larger number of high-resolution cameras (usually at least four) and good spatial coverage of the scene (usually circular coverage) to resolve ambiguities due to occlusions.
Other methods [21, 8, 23] construct a proxy geometry either using multi-view silhouettes or multi-view stereo. The skeleton is then fitted into this geometry. These methods provide very good results, but impose restrictions on the setup. They require a carefully built studio setup, many high resolution cameras and very good spatial coverage.
Another class of algorithms is based on image analysis and segmentation [15, 10]. These algorithms use machine learning methods to discriminate between body parts. This analysis generally requires high resolution footage, which is not available in our setup.