Accurate camera registration (e.g., data indicating a location and positioning of a camera) is required for many current applications that use video data, such as augmented reality and three-dimensional (3D) reconstruction. To generate camera registration data for a camera in a well textured environment, many current technologies use multiple sensors to supplement data obtained from the camera. However, some environments make generating camera registration data in this manner difficult. For example, environments such as sports arenas may have many spatial markings, like well-marked lines. But such markings are typically configured in very repetitive patterns and with little texture differentiation from their surrounding surfaces, thus making such marking of limited utility in assisting with camera registration. Moreover, such environments may have poor lighting conditions and frequent, moving, occlusions (e.g., players, a ball).
Keypoint-based technologies that are commonly used for camera registration will frequently fail in environments like a sports arena. Edge and line information may be used as an alternative method of determining camera registration information, but methods using edge and line information require sensitive parametrization based on vanishing points, tend to be very slow, and do not generalize well to various scenarios. In another alternative, direct regression from an image to camera pose information can be used, but such methods fail to provide the needed accuracy.
The dimensions of a sports field or arena are likely to be known and 3D models of such environments may be available. Attempts have been made to identify projections of specific parts of such 3D models in an image and establish 3D-to-2D correspondences. These correspondences have been used to compute one or more camera parameters. However, the patterns in sports arenas and fields may be repetitive and the lighting may be poor. Occlusions may frequently be present in such environments and may be nearly constantly moving (e.g., players on the field or in the arena). Therefore, 3D-to-2D correspondences established using traditional methods, such as Scale Invariant Feature Transform (SIFT), Speed up Robust Feature (SURF), or Binary Robust Independent Elementary Features (BRIEF), may be unreliable, resulting in the frequent failure of camera pose estimation approaches that use these methods.
Other attempts have been made to overcome the difficulties of camera pose estimation in sporting environments by leveraging the specificities of sports fields or arenas without resorting to the use of additional sensors. In a soccer field example, the field may be large and the lines delimiting the field may be widely separated. As a result, in many camera views, too few lines may be visible for reliable camera registration. One effort to address this problem uses a two-point method, which, while potentially effective, may be very restrictive because such a method may require prior knowledge of a position and a rotation axis of a camera.
A mathematical characterization of a feature of a sporting environment, such as a central circle of a soccer field, has been used to assist in overcoming a shortage of features and may help estimate a homography. Similarly, points, lines, and/or ellipses may be used to localize sporting environments. While these methods may be effective for views of a specific sporting environment or type of environment, such methods lack general applicability to varying types of environments.
Homography estimation may use a dictionary of precomputed synthetic edge images and corresponding poses. For example, for a given input image, a nearest-neighbor search may be performed to locate a most similar neighbor stored in a database. When used with a video sequence, such homography estimation techniques may enforce temporal consistency and smoothness over generated homography estimates over time. However, a limiting factor of such techniques is the variability of potential poses of figures in an image and neighbor figures that may be used to locate nearest-neighbors in a database, which may require a very large dictionary.
In another attempt at pose estimation, the homography relating an image plane to a soccer field has been estimated using a branch and bound inference in a Markov random field (MRF) whose energy may be minimized when the image and a generative model agree. The image may first be segmented using a deep network to locate lines, circles, and grassy areas, and then vanishing points estimated. The estimated vanishing points may be used to constrain the search for a homography matrix and accelerate energy minimization. However, dependence upon correct estimations of vanishing points reveals a vulnerability in this approach because vanishing point estimation computations are known to be error-prone, especially when there is severe perspective distortion present.