An accurate camera pose is essential information in many systems, for example, in camera systems intended to broadcast sporting events at stadiums. Some elements of a camera pose (e.g., pan, tilt, roll, position) are sometimes known, fixed, or obtainable with inexpensive sensors. Pan means translation; roll means rotation. The current or instantaneous focal length of a zoomable camera is less frequently available, or of insufficient precision for many applications. In cameras that make their current focal length externally available, the resolution and/or absolute accuracy of the data may be too low for the application that requires focal length data. Thus, using a zoomable camera that outputs its focal length has proven to be unreliable and problematic.
In some cases, a portion or an entirety of the elements of a first camera pose can be determined by comparing the current view of the first camera with one or more static images or models of a scene, possibly derived from cameras beforehand. In some other cases, a portion or entirety of the elements of a first camera pose can be determined by comparing the current view of the first camera with concurrent images from one or more additional cameras, some of whose parameters are known. The extrinsic parameters of a camera include pan, roll, tilt, camera position, etc. This technique can be more useful than basing the determination on predetermined static images or scene models, since it can adapt to changes in lighting or background.
Many systems rely upon visual recognition of pre-determined scenes to solve for focal length (and other camera parameters). However, when the current scene is not a pre-determined, expected scene, a camera pose is not calculable. This may also occur when the camera is pointed away from a pre-determined scene (for instance, pointing at the audience), or when the camera is zoomed so far in or out that either expected fiducials are too few in number, or so small that they are unusable, or are occluded by foreground objects.
Typically, positions of landmarks in the scene are represented by a 3D Model. At intermediate zoom levels, the landmark position points of the model may be matched with their corresponding feature points from the current video image. Based on these pairs of corresponding points a homography (projective mapping between planar points from a 3D space and their projection in the image space) is calculated. Then, camera parameters are estimated based on the calculated homography. Pre-determined landmarks imply the need for a scene model. This is often inconvenient, or impossible. Nevertheless, a relative homography (or the homography) may be established between zoom-invariant, but ad-hoc feature points in simultaneous views of the same scene from two cameras.