It is likely that in the near future three-dimensional (3D) display devices will become increasingly common in home and business environments. Such devices are either stereoscopic, which require the user to wear special glasses to see the 3D image, or autostereoscopic, which do not require any special glasses in order to see the 3D image. To create a 3D image, two different 2D images are needed, that are provided one to left eye of the user and one to the right eye of the user. It is also sufficient to provide a single image and either an additional depth map or disparity map, which contains sufficient information to allow the second image to be generated. This latter solution has a number of advantages because it allows more flexibility in the final delivery of the 3D image.
However it is the case that at the present time, and for the foreseeable future, most images and video will be generated as a 2D image frame. In order to create a 3D image when the original source is a 2D image, then a depth map needs to be created. This depth map can be used to create the second image, or to create a disparity map for a second image.
Much research has been performed recently on the topic of soccer analysis and conversion of 2D soccer video to 3D [see references 1 to 4]. Most of these approaches estimate a 3D model from the available data. Several approaches use multiple cameras that are manually/automatically calibrated [see references 1 and 4]. Generally, the calibration is done using intersections of the lines visible on the soccer field. This mainly works well in the area around the goals, where many lines are visible in a camera view. This method can be extended to the centre of the field by adding an ellipse detection method. Such an approach is less effective when very few lines (or no lines at all) are visible in a view. In this case, it is possible to use motion estimation to compute the homography between subsequent frames.
In generating a depth map, the next step is that the players and the ball are detected and their 3D positions are estimated, usually using color segmentation [reference 2]. Liu et al. use a Gaussian Mixture Model to detect the playfield [reference 2], while it is also known to use a histogram-based approach combining HSI and RGB colour-spaces. It is also possible to use the colours of shirts and pants of both teams, and detect combinations of shirts and pants. In this case it is possible to track separately multiple players that occlude each other using the colour of the shirts, their relative vertical position, and/or the average velocity of the players. The position of the ball can be easily estimated when it is on the ground, but is difficult to estimate in the air. In such a case, a parabolic trajectory is typically assumed, and therefore the two points where it touches the ground are required. Liu et al. manually indicate such points [reference 2]. A different solution is to use multiple cameras, or a single camera and the change in direction of the ball when it touches the ground.
The main application in those works is free viewpoint video, where a user can choose a view from an arbitrary viewpoint, interpolated from the captured viewpoints at the fixed camera positions. In such a case, a 3D reconstruction of the field, players and ball is often built from the input data. This process of placing the players and ball in the correct position on a virtual 3D soccer field puts additional requirements related to pose estimation of the players, or matting for more precise segmentation. If a player is not detected, he cannot be placed correctly on the 3D model.
In an application such as 3D TV, the main goal is to produce visually pleasing depth images. The constraints for such an application are different from those for free viewpoint video, and often less strict. For example, a player that is not detected receives the same depth values as the surrounding field pixels. This diminishes the depth effect, and gives a local distortion, but it does not create artefacts such as when a full 3D model is reconstructed. However, a high robustness and temporal stability are needed for a pleasing viewing experience.
The main problem with the existing methods is their failure for specific types of scenes. As described above, the camera calibration works well for scenes containing the goal, but performs much worse when a central part of the field is captured. Another problem of the existing methods is that they sometimes require setups with special cameras, for which costly adaptations to the capturing infrastructure need to be made.
The focus in the prior art algorithms is on producing correct 3D models, not on the 3D impression and robustness. To overcome issues with instability or weak points in the algorithms, many of the presented algorithms require a considerable amount of manual intervention to indicate, for example the line intersections, starting and ending points of the ball on the ground between a trajectory through the air, or corrections of players difficult to segment.