Automatic tracking of objects from one or more cameras is a prominent area within the field of computer vision. Typically, it involves calibrating the camera, modeling the scene background, detecting the foreground regions, and employing a known tracking algorithm to derive the instantaneous location of objects within the field of view of the camera. Tracking systems are widely employed for applications such as defense and civil surveillance, traffic control, and game enhancement. In the case of game enhancement, player tracking systems in a sporting event can provide game statistics that may be presented to viewers, coaches, or players during a live broadcast or later for offline analysis and storage. Another use for player tracking from a video of a sporting event is annotation—the location of players may be highlighted and their maneuvers on the court or field may be trailed. Throughout this disclosure “court” will be used to encompass a court such as a basketball court, a field such as a football field, a rink such as a hockey rink, or any other defined area on which a sport may be played.
Tracking of objects or image patterns is generally achieved through an analysis of their corresponding image-regions in each video frame. Based on a metric measured between an object's model and the image-regions' (foregrounds') descriptors, the most likely current location of the object is estimated in image-space coordinates. To derive the real-world location of a tracked object, the camera's parameters (model) should be given. A camera's model may be generally obtained through a calibration process carried out before the event, and in the case of a non-stationary camera (the broadcast camera, for example) this model should be updated for each frame as the camera's point of view varies.
Computing the camera model may require prior knowledge of the scene (such as a 3D model of the game court). The 3D model of the scene is then aligned with the current image frame to allow for the computation of the camera's parameters. This alignment may be done using a search algorithm that recognizes the image projections of features from the real-world scene (such as junction/corner points, lines, and conics). Then, an alignment (registration) method may be employed to find the mathematical transformation (homography) that maps these features from their known 3D locations in the scene to their corresponding image projections in the video frame. In the case where the camera's pose changes, the features' locations in the image frames may be tracked through time to allow update of the homography. Known in the art methods derive the camera's parameters (e.g. focal distance, tilt, pan, and orientation) from a given homography. There are two drawbacks to this approach: 1) prior knowledge of the scene is required, and 2) strong and distinctive features need to be present in the field of view to obtain reliable feature recognition. Furthermore, features should come from a plane so that a homography can be computed for the case of a moving camera.
Another alternative to the vision-based camera calibration is using an instrumented camera, where various sensors read the current camera's position, tilt, and orientation. For example, handset devices equipped with a satellite positioning (GPS) capabilities, a tilt sensor, and a digital compass, may employ augmented reality to video taken by their embedded camera and may insert time- and location-sensitive information using fast connectivity to the internet. Such technology is limited by the accuracy of today's GPS units and the quality of the video camera.