A prior automatic method of detection of persons in a video involves the use of ‘histograms of oriented gradients’ (“HOG”) as an effective means for detecting pedestrians within arbitrary still images. Although the method is quite effective at finding people, the approach is very computationally intensive and therefore slow. Moreover, the HOG pedestrian detector must be trained on a large set of manually labeled data. Publicly available implementations trained on images of pedestrians may not be able to handle the more complex poses of sports players and vantage points which were not included in the training database. Additionally, the HOG descriptor only uses the information from a single frame of video. If a continuous action sport is observed from a stationary camera, temporal information is also available.
Background subtraction is a second automatic method for detecting moving objects. In a background subtraction process each pixel in the video frame is compared to its corresponding pixel in the previous frame or to its corresponding pixel in a reference image that models the background scene possibly based on temporal history. Typically, the output of a background subtraction process is a binary background mask indicating if the corresponding pixel in the video frame is a foreground (“1”) or a background (“0”) pixel. Alternatively, a background mask may indicate the probability of a pixel being a foreground pixel and hence assumes continuous values. Various background subtraction embodiments are described in U.S. patent application Ser. No. 12/403,857, incorporated by reference herein in its entirety. Note that other methods in the art may be used for foreground detection. For example, a background mask may be generated from depth information (either from a single structured light camera or via the disparity map from a stereo video pair). Similarly to estimating background appearance from the temporal history, modeling the geometry of the scene through temporal history of the depth information (or from a combination of camera's parameters and scene geometry model) is known in the art. Other modalities such as thermal cameras may be used as well for foreground detection.
Background subtraction is very efficient as an initial foreground detection step. Although background subtraction is fast, straightforward implementation of it is also fairly naive and may produce incorrect results. Camera shake, for instance, causes many false foreground detections. This is especially true in high-definition video of outdoor sports, since there is strong contrast between grass and pitch markings. Camera shake, though, may be handled by compensating for vibration when comparing two consecutive frames or when comparing the current frame to a reference (background) image. However, this requires an additional step of image registration that is computationally involved. Another complexity is small appearance changes, caused by rain, snow, and shadows cast by players, for example. Shadows may be detected as foreground objects and are difficult to model. Another challenge is to discriminate between foreground and background regions with similar appearance: players' green uniforms may be mistaken for grass, and their torsos will not be detected as foreground, for example. A robust interpretation of background subtraction results is therefore important for reliable foreground detection.
Typically, a sensitivity threshold is used to determine whether a particular pixel is a foreground pixel. Hence, a pixel may be determined to be a foreground pixel based on its lack of similarity to the average of recent previous values or based on its lack of similarity to the corresponding pixel in a reference (background) image. Individual foreground pixels are then clustered into ‘blobs’ by finding the connected components in the binary image (background mask). Since the background mask may be noisy, a single connected component may be not corresponding to a complete object. Instead, a second clustering of ‘blobs’ into ‘objects’ is often performed. In this step, it is quite difficult to determine automatically (1) how many ‘objects’ exists, (2) which ‘blobs’ associate to which ‘objects’, and (3) the identification of ‘blobs’ which do not correspond to any ‘objects’ (in these cases where the ‘blobs’ were incorrectly identified as foreground regions during background subtraction).
The association of ‘blobs’ to ‘objects’ is often ambiguous. Therefore, this particular technique of extracting objects from blobs may be unreliable. For instance, the outstretched leg of one player could be considered as the arm of another player. Complex association heuristics may work in some situations, but they tend to fail catastrophically in other circumstances. To improve the robustness of the ‘blobs’ to ‘objects’ association, small ‘blobs’ are often removed in a preprocessing stage of image erosion followed by image dilation. However, this method could easily discard minute correct detections, such as a player on the far side of the field. Additionally, if many small ‘blobs’ are present in the image, the connected components algorithm (which groups pixels into ‘blobs’) may take an excessively long time to process. For instance, when it is raining, the image will be littered with many small false detections or blobs may contain more than one player.
Once an object's foreground was detected in the image space, its position at the scene is computed. The 3D ground position X of a player, for instance, is estimated by finding the 2D location x of their feet in the image and mapping these locations to the ground plane using a homography H obtained from a full or partial camera calibration: x=HX. Knowledge of the homography, H, allows for one-to-one association of a 3D point at the scene, X, with its corresponding 2D point at image space, x. In order to uniquely associate a point at image space, x, with its corresponding point at the scene, X, the homography is expressed with respect to a specific plane in the 3D world. Specifically, the 3D location of a pixel at the player's head-top can be found based on the homography and knowledge of the player's height (i.e. the 2D plane crossing X=(x,y,h)). Similarly, the 3D location of a pixel at the player's foot can be found based on the homography and knowledge of player's feet level (i.e. the 2D plane crossing X=(x,y,0)). Hence, in order to map a pixel in image space to its corresponding 3D point at the scene, knowledge of this point height, for example, is necessary.
A method to interpret background subtraction results on the ground plane (instead of the image plane) involves the realization that homography which maps the object's foreground to the 3D scene, assuming all pixels at ground level, is only valid for those pixels which are a projection of part of the object that actually exists on the ground (such as feet and shadows). This mapping of objects' foreground from the image space to a plane in the 3D scene resulting in a ground map named “occupancy map” that may be an aggregation of mapping results from several cameras positioned at different vantage points. Therefore, a player's feet will consistently map to the same ground position from multiple vantage points, but the upper body will map to different areas. As a result, determining the number of players and their positions in the world is equivalent to finding local peaks in the occupancy map produced by aggregating the mapping of foreground regions detected from multiple vantage points onto the ground plane. In addition to avoiding the heuristics of clustering ‘blobs’, this method also avoids the threshold stage of background subtraction, as no further analysis involves binary image processing. More details about muliview occupancy map can be found in Khan S. M. and Shah, M., “Tracking multiple occluding people by localizing on multiple scene planes”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 3, pp. 505-519, March 2009.
The insight that a player's feet will map to the same ground location in all camera views does not apply for just the z=0 plane. If the camera is fully calibrated, it is possible to map the image onto any plane. Following a similar logic, the top of a player's head should consistently map to the same (x, y) location on the z=h plane (assuming the player is h meters tall and standing upright). In fact, some part of the player's body will map to the same (x, y) location for any plane 0≦z≦h parallel to the ground. A height-specific occupancy map generated from multiple parallel planes produces local peaks which are much more dominant than the local peaks of an occupancy map produced from just the ground plane. Such occupancy maps have been generated using multiple cameras from different vantage points. In addition to a player's feet, the ground plane will also have consistent mappings between shadows. However, on horizontal planes above the ground, the shadows will not map to consistent locations. As a result, estimating the (x, y) location of the player by fitting a vertical axis through multiple planes of data is much more reliable than simply searching for the midpoint of the feet on the ground plane.
Mapping to multiple planes is more computationally expensive. In the continuous case, the integration over an infinite number of planes can be projected back into the image. Essentially, this corresponds to pre-computing bounding convex hulls in the image plane for every (x, y) location on the ground. Summing the number of foreground pixels within each convex hull is also an expensive computation. However, if the image is warped such that convex hulls become rectangles, the computation can be optimized using integral images. Although, one can approximate the convex hulls as rectangles, a warp which sends the vertical vanishing point to infinity accomplishes the necessary rectification of the convex hulls, and is equivalent to tilting the camera so that it is level. However, if the camera is horizontal, the precision at which the (x, y) location can be estimated is greatly reduced. For high vantage points the image may be warped such that the optical axis of the camera is perpendicular to the ground. In this perspective, the ability to localize objects on the ground plane is optimal, but the ability to identify objects of height h is minimal. As a result, the integral image optimization requires many views to get accurate positions of objects which are h meters tall. For large outdoor playing areas, many cameras will be needed, and a significantly high vantage point for an approximate overhead view may not be possible. Additionally, since simultaneous access to the raw pixel data from every camera is needed to localize players, there is a high bandwidth requirement for real-time analysis, as all data must be analyzed at a single location.
The above approaches have exhibited numerous disadvantages. Those relying on a single-camera approach have been either slow (HOG) or unreliable (“blob” clustering). Other approaches have relied on the fusing of all the video data from multiple cameras simultaneously, in order to provide a central processing location with simultaneous access to all pixels from all cameras. Performing player detection on this basis requires significant bandwidth; in fact, the bandwidth of gigabit Ethernet limits these approaches to only two or three high definition (HD) cameras. A disadvantage with this fusion approach is that it does not scale well to a large number of cameras, since all the pixel data must be transmitted to a central location for processing. It would be advantageous, in order to ensure greater player detection accuracy and avoid false detections, if a non-fusion method for detecting players in real-time with minimal latency (for example, latency of one frame) could be scaled to a significantly larger number of cameras.