A model of probabilistic object motion is required to track objects as they move so that a tracking algorithm may determine a trajectory from a set of hypotheses that is the most realistic. If multiple objects are being tracked concurrently, prior data is used to evaluate the feasibility of a set of simultaneous trajectories for all the objects being tracked. Conventionally, most multi-object tracking algorithms utilize a drastic simplification to keep the inference problem tractable. Specifically, each of the objects is evaluated in isolation such that each object is tracked independently of other objects. However, object motions are not always independent. In team sports, for example, player movements are highly correlated to both nearby and distant players. It is highly inaccurate to evaluate each trajectory using the conventional method of isolating each object. However, it is also quite difficult to determine and optimize a complex model which describes all possible interactions between players.
With regard to multi-target tracking, this has been a difficult problem of broad interest in the technical field of computer vision. Surveillance is a common scenario in which multi-target tracking is utilized. Team sports are another popular domain utilizing multi-target tracking that has a wide range of applications in strategy analysis, automated broadcasting, and content-based retrieval. Recent developments in pedestrian tracking have utilized a formulation of multi-target tracking in terms of data association. For example, a set of potential target locations is estimated in each frame using an object detector and target trajectories are inferred by linking similar detections (or tracklets) across frames. However, if complex inter-tracklet affinity models are used, the association problem quickly becomes non-deterministic polynomial-time (NP) difficult.
Recent success in pedestrian tracking has posed multi-target tracking as a data association. That is, long object trajectories are found by linking together a series of detections or short tracklets. Conventionally, the problem of associating tracklets across time utilizes a variety of methods such as the Hungarian algorithm, linear programming, cost-flow networks, maximum weight independent sets, continuous-discrete optimization, higher-order motion models, etc. Data association is often formulated as a linear assignment problem where the cost of linking one tracklet to another is some function of extracted features (e.g., motion and appearance). Other conventional methods consider more complex association costs.
With regard to pedestrian tracking, crowds are an extreme case of pedestrian tracking where it is often not possible to see each individual in their entirety, if at all. Because of congestion, pedestrian motions are often quite similar and crowd tracking algorithms typically estimate a finite set of global motions. Often, the affinity for linking two tracklets together depends on how well the hypothesized motion agrees with one of the global motions. A conventional approach solves tracking in crowded structured scenes with floor fields estimation and Motion Structure Tracker (MST), respectively, while another conventional approach uses a Correlated Topic Model (CTM) for crowded, unstructured scenes.
Although more complex approaches have been devised, simple, independent motion models have been popular for pedestrian tracking because they limit the complexity of the underlying inference problem. However, the models may not always characterize the motion affinity between a pair of tracklets accurately. A conventional approach models inter-target correlations between pedestrians using context which consists of additional terms in the data association affinity measure based on the spatiotemporal properties of tracklet pairs (e.g., a pedestrian may deviate from a constant velocity trajectory if he/she anticipates colliding with another pedestrian). Much like differences between the individual target motions in surveillance and team sports, context in team sports (e.g., the current game situation) is more complex and dynamic compared to surveillance. For example, teams will frequently gain and lose possession of the ball and the motions of all players often change drastically when this occurs. Accordingly, an application of the conventional approaches to context does not provide similar or accurate results, particularly in view of the complexity that is introduced in team sports.
Tracking players in team sports has three significant differences compared to pedestrians in surveillance. First, the appearance features of detections are less discriminative due to players on a common team being visually similar (e.g., same uniform). With regard to tracking, the distinguishing characteristics between teammates are primarily position and velocity. Second, pedestrians tend to move along straight lines at constant speed whereas sports players move in more erratic manners. Third, although pedestrians deviate to avoid colliding with each other, the motions between pedestrians are rarely correlated in complex ways (e.g., some scenarios like sidewalks may contain a finite number of common global motions). On the other hand, the movements of sports players are strongly correlated both locally and globally. For example, opposing players may exhibit strong local correlations when “marking” each other (e.g., one-on-one defensive assignments). Similarly, players who are far away from each other move in globally correlated ways because they are reacting to the same ball.
Conventional approaches to multi-tracking in team sports utilize algorithms based on particle filters. However, results are quite often demonstrated only on short sequences (e.g., less than two (2) minutes). Other conventional approaches generate a Bayes network of splitting and merging tracklets for a long ten (10) minute soccer sequence to find the most probable assignment of player identities using max-margin message passing.
In both pedestrian and player tracking, object motions are often assumed to be independent and modeled as zero displacement (for erratic motion) and constant velocity (for smooth motion governed by inertia). In reality, the locations and motions of players are strongly correlated. Pair-wise repulsive forces have been used in multi-target tracking to enforce separability between objects. Conventional approaches use multi-object motion models in pedestrian tracking to anticipate how people will change their trajectories to avoid collisions or for estimating whether a pair of trajectories have correlated motions. In team sports, a conventional approach estimates motion fields using the velocities of tracklets to anticipate how the play will evolve but does not use the motion fields to track players over long sequences. In yet another conventional approach, the standard independent autoregressinve motion model is augmented with a database of a priori trajectories manually annotated from other games.
Because a player's movement is influenced by multiple factors, the conventional multi-target tracking formulation using a set of independent autoregressive motion models is a poor representation of how sports players actually move. Furthermore, motion affinity models involving multiple targets (and that do not decompose into a product of pairwise terms) make the data association problem NP hard.