The present invention relates to markerless methods for tracking the motion of a subject within a 3-dimensional (3D) space. The subject can be a human form or any form that can be modeled by a structure comprised of rigid linked segments, referred to in the art as linked kinematic chains. Linked kinematic chains are representations of non-deformable segments of the form being tracked. For example; a model of a human form could comprise rigid geometric segments, ellipsoids and cylinders, representing the head, torso, and limbs, mechanically interconnected by flexible joints that provide for rotation and angular positioning between the segments. Examples of subjects the method is applicable to tracking include but is not limited to humans, robots, laboratory animals, construction cranes and mechanical arms. The workspace within which the subject is moving, is divided into a plurality of smaller volume elements or voxels. The subject to be tracked is viewed by a system of multiple, video imaging cameras whose outputs are digitally acquired and processed on a personal computer. The method extracts silhouettes of the subject in each of the cameras views using background subtraction methods. A method is described for intersecting the multiple silhouettes to create a three dimensional volumetric data set, comprising a listing of the voxels within the workspace that are occupied by the subject. Volumetric data is collected in real time on a frame by frame basis. As the subjects motion is tracked from image frame to image frame, subsequent volumetric data indicates the new location of the subject as occupied voxels lying outside the previously computed location of the model. This new voxel data set is used to calculate virtual forces exerted on the model which act to align the model with the new voxel data set. This novel, physics-based tracking approach reduces the number of local minima that has hindered prior methods. The linked nature of the model additionally provides for tracking of partially occluded limbs. The model in the current method is dynamic and size parameters of the linked rigid segments are determined automatically by a growing process, during an initialization phase. A similar growing process is used to reacquire tracking if portions of the body are lost during tracking due to occlusion or noise. The dynamic model also incorporates velocity, joint angle, and self-collision limits. In an example embodiment, an imaging system with four cameras and a single PC operates the method in real-time at about 18 frames per second. The method describes a process for extracting volumetric data from a video imaging system, using the data to initiate dimension and initial positioning of a model template and an iterative process for aligning the model to subsequent data sets, acquired from each frame of the image, as the subject moves through a workspace. Three dimensional volumetric data is used to calculate forces acting on the model in three dimensions, which provides for a substantial improvement in tracking accuracy over prior methods performing force calculations in two dimensions, with or without multiple cameras.
In the following description of the invention, the method is applied to an embodiment for tracking human motion within a workspace. It will become obvious to one skilled in the art that the method is generally applicable to tracking the motion of any subject that can be modeled as rigid linked segments. Additional benefits to modeling the subject as a series of linked kinematic chains, as opposed to methods using a deformable model, will become apparent. An example of a method incorporating a deformable model can be found in U.S. Pat. No. 6,574,353 to Schoepflin et. al.
Video based 3D motion tracking is an active research topic. An overview of the field and this invention, including related references, is provided in J. P. Luck, “Real-Time Markerless Human Motion Tracking Using Linked Kinematics Chains”, Doctor of Philosophy Thesis, Department of Engineering Systems, Colorado School of Mines, 2003, which is herein incorporated by reference in its entirety.
The need to improve man-machine interaction has made real-time markerless 3D motion tracking a highly valued goal. Current motion capture systems and methods require the subject to attach either magnetic or optical markers to their body and may even require cables extending from the markers to the system. Obviously a markerless method that does not require the user to wear specialized equipment would be more convenient, less constraining, more desirable, and in many cases necessary for the multitude of applications of such a process. These applications include:                Surveillance—In many surveillance situations such as access control, parking lot monitoring, ATM machine monitoring, or monitoring of publicly displayed valuables, it is helpful to not only know if humans are present but more importantly what they are doing.        Motion analysis—Motion analysis is extremely helpful in medical and biomechanical applications for the analysis of both injuries and the rehabilitation of injuries. Motion analysis has also become prevalent in sport applications. Analysis of swing can greatly improve a player's performance in golf, tennis, baseball, etc. A tracking system could also provide useful information in dance or ballet.        Virtual reality and video games—Tracking the human form is necessary to create interactive virtual environments. A tracking system could also be used for video games in which the user actually performs the motions of his or her screen character.        Teleconferencing—A virtual 3D teleconference could occur in real-time, because of the low bandwidth associated with transmitting joint angles.        Human robot interaction—Utilizing gesture control of systems, this type of tracking system would greatly enhance robot machine interaction. In industry typically safety constraints do not allow humans to enter spaces where large robots are performing tasks. However if the robot control system had knowledge of where the person was and what he or she was doing, the two could work together to perform tasks.        Advanced user interfaces—If a virtual interfaces are to be designed, the user must be tracked so that the system will know which virtual controls are being manipulated.        
This vast array of applications presents many challenges to motion tracking. A system must be fast enough to collect data and track the subject in real-time. The system architecture must be relatively inexpensive and simple to be used for most applications—i.e. a system requiring expensive cameras and/or multiple computer architectures are of little use. The system must also be able to acquire the model on its own, and track unconstrained movement without the use of markers or specialized costumes.
Due to the growing number of applications for such a system, there has been a fair amount of interest in this field. For an excellent review of the state of the art of visually based motion tracking, see: D. M. Gavrila, “The Visual Analysis of Human Movement: A Survey”, Computer Vision and Image Understanding, Vol. 73, No. 1, January 1999, pp. 82-98. Gavrila's review groups work within the art of video based motion tracking into four categories:
2D tracking approaches
3D tracking approaches using a single viewpoint
3D tracking approaches using multiple viewpoints
3D tracking approaches using 3D data.
Of course, some approaches may only slightly correspond to a particular category or may overlap two or more categories, but Gavrila provides a general framework for distinguishing the work and drawing comparisons between approaches.
2D Tracking Approaches The most simplistic process for acquiring the movement of a human in 2D is to assume that the movement can be accurately described with low-level 2D features, and hence there is no need for a model of the body. These techniques usually superimpose a grid over the region of interest, and extract low-level features for each region of the grid, such as normal flow, the number of foreground pixels, or an averaged edge vector. These techniques are adequate for many gesture interpretation tasks. However this type of approach is not accurate or robust enough for detailed tracking of a human model, mainly due to its inability to deal with occlusion.
More accuracy is achieved by using a simple body model. Many of these approaches use a simple model to identify the locations of segments in the image silhouette or outline. Some simply try to label segments using models based on: the order of segments, heuristics and learning of normal body outlines, generic postures of a human, a stick figure model of a human, or protrusions in a human outline. Others actually track segments in the silhouette using Kalman filtering, or simply track the contours of the human outline. While these approaches demonstrate the many ways a simplistic body model can be utilized in tracking, they only provide a coarse estimate of segment locations.
In order to gain further accuracy in 2D approaches, many researchers migrated to more descriptive models of the body. The motivation was that by expanding the model to include the size, shape, and connectivity of the subject, many ambiguities could be resolved. Techniques use a wide variety of data in their tracking and create a model to correspond to the data source. Silhouette based techniques incorporated the size, shape, and connectivity of segments to label their locations in the image, or tracked models consisting of either anti-parallel lines or “ribbons” (the 2D analog of a generalized cone). Optical flow based techniques align a 2D planar model to optical flow data. Another technique tracks a model consisting of blobs using motion and other low level features, such as color and shape. A degree of robustness is added by incorporating a more descriptive human model. However since the human is a three-dimensional object its shape changes depending upon its orientation to the camera; hence accurate 2D models are difficult to create. The problem is further complicated because segments move in and out of occlusion. One team attempted to mitigate this problem by switching between camera views, see: Cai, Q. and J. K. Aggarwal, “Tracking Human Motion in Structured Environments Using a Distributed-Camera System”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 2, No. 11, pp. 1241-1248. However, there are many poses in which relevant information cannot be found for all body segments from single view point. Another method using 2D data is described in U.S. Pat. No. 6,492,986 issued to Metaxas et. al. Again, the difficulties encountered in applying 2D data to a 3D model are inherent in this approach and it also describes a deformable model.
2D tracking has shown the ability to interact with computer systems as long as segments are in view of the camera. However in his review Gavrila concludes that 2D approaches are only effective for applications where precise pose recovery is not necessary, and most applications actually require the 3D pose of the subject. Consequently researchers have strived to extend tracking to 3D.
3D Tracking Approaches using a Single Viewpoint A large portion of the research in single viewpoint 3D tracking endeavors to simply extend the methods used for 2D analysis. Researchers again attempted to track features on the human to establish a 3D pose. However, joint locations show no characteristic intensity function; hence their locations are usually found indirectly by segmenting adjoining segments, which is extremely difficult and again complicated by occlusion. Accordingly many researchers extend 2D outline-based techniques to a 3D model. Researchers also attempted to employ optical flow to track the subject. Both approaches are usually performed by projecting the 3D model to the 2D image, and adjusting the model pose to fit the silhouette or optical flow data. However problems can occur when alignment forces are calculated in the 2D image and applied to a 3D model. 2D forces cannot account for changes in the model's appearance when it is twists about an axis that is not perpendicular to the image plane; hence the resulting model forces are often incorrect. Also optical flow can only be calculated within the image plane and alignment forces perpendicular to the image plane are very difficult to calculate from silhouette data; hence the process have problems estimating movement towards or away from the camera.
Researches also incorporated additional data sources to the above techniques to improve tracking results. Additional data sources include: locating the head and hands using skin color, using an occlusion map to indicate when a particular body segment is occluded, propagating constraints through the linked model, or using color and inverse kinematics to refine the hand position. Including additional data sources gains robustness because when one data source is insufficient to establish the pose the other can help.
Since all of these optimization schemes are susceptible to local minima, others have introduced the idea of multiple hypothesis tracking to find the correct solution. In this case, an optimization is run from more than one starting point using directions of high uncertainty in a Hessian matrix. While this addition greatly slows the process, it achieves some robustness to local minima, which is lacking in most approaches. The process works by restarting the optimization from several different poses and choosing the solution with the lowest error score. Because of the difficulty in finding reasonable starting poses and because the optimization must be rerun for each starting pose, the technique has not been made efficient enough to run in real-time.
Although these methods represent some improvement, such as applying forces to adjust a model, combining data sources, and incorporating multiple hypotheses, three major problems inhibit single viewpoint techniques. First, the problem of dealing with occlusion in a single viewpoint has not been robustly solved. For example if a person is turning around, these process will not be able to track any limbs behind the body. Although techniques that switch between cameras ease this problem, in many poses there is no single viewpoint from which one can accurately see all of the limbs. Secondly, 2D forces cannot account for changes in the model's appearance when it is twists about an axis that is not perpendicular to the image plane. Lastly, even when a single viewpoint method is able resolve the correct orientation of the articulated models, it still has difficulty establishing the actual 3D position. Many of these methods must know the actual size of the model to extrapolate the 3D position, or they estimate the 3D position from a pre-calibrated CAD model of the room, or they attempt to resolve the 3D position by tracking points over multiple frames. Hence, single viewpoint techniques are not robust enough to track unconstrained movements.
3D Tracking Approaches using 2D Data Obtained from Multiple Viewpoints To avoid occlusion and resolve the depth of the subject, many researchers have moved to systems that obtain 2D data from several disparate viewpoints so that multiple sides of a subject can be seen at the same time. These approaches either combine fitting performed in multiple 2D images to get 3D, or calculate forces/adjustments in 2D and apply them to a 3D model.
Many approaches again center on the silhouette or body outline, or extend optical flow based tracking to multiple viewpoints. As in single view approaches several methods that combine data sources to track blobs representing each segment have been developed. Lastly others have employed a multi-hypothesis approach, which again adds robustness but is not able to operate in real-time. All of these techniques represent improvements; however problems often occur when transforming from the 2D image to the 3D world.
First, when fitting is performed in each 2D image and combined to get 3D, any error in the 2D fitting will propagate into the 3D result. Hence these approaches are subject to the same faults as their 2D counterparts. For instance if limbs are partially occluded in a viewpoint, the process may still attempt to fit the limb and do so incorrectly. When the results are combined from each 2D view these erroneous measurements will propagate into the 3D result. To avoid this problem the process would have to decide when an accurate fit cannot be obtained in a viewpoint, which is an extremely difficult problem. Then the process would have to rely on the measurements in the other viewpoints to obtain the 3D result, which may not be sufficient to resolve all ambiguities. Hence the approach must either attempt to estimate a 2D pose from incomplete information in a particular viewpoint, or ignore any information in that viewpoint and attempt to resolve ambiguities from fewer viewpoints.
Secondly, the inclusion of multiple viewpoints does not remove the problems seen in single viewpoint techniques of applying forces calculated in 2D to a 3D model. 2D forces or adjustments cannot account for changes in the model's appearance when it is twists about an axis that is not perpendicular to the image plane; hence the resulting model forces are often incorrect.
Lastly, combining information from multiple viewpoints is not a straightforward problem. Combining 2D poses to obtain a 3D pose is complicated because small changes in one viewpoint could account for a large change in others. Therefore how does one find the best 3D pose from multiple 2D estimates when they do not align perfectly, which will be the norm for real world data. Similarly when combining tracking information (adjustments or forces) from multiple 2D viewpoints, one must decide how to combine conflicting information from separate viewpoints.
3D Tracking Approaches using 3D Data Due to the problems of performing 3D fitting from 2D images, transforming 2D information to the 3D world, and of combining conflicting information from multiple viewpoints, it is preferred to work directly with 3D data. Several research teams use 3D data taken with a range finder or from stereo cameras. In particular one method of combining stereo points with silhouette data yields extremely accurate results, see: Plankers, R. and P. Fua, “Tracking and Modeling People in Video Sequences”, Computer Vision and Image Understanding Vol. 81, No. 3, pp. 285-302. This method fits a model made up of “metaballs”, which change shape to account for movement of the muscles. The technique cannot operate in real-time and requires user to help extract the silhouette, but again shows robustness through combining multiple data sources. However information from more than one viewpoint must be incorporated to overcome occlusion.
Others have employed constraint fusion to 3D data, see: “Badler, N. I., et. al, “Articulated Figure Positioning by Multiple Constraints”, IEEE computer Graphics and Applications, Vol. 7, No. 6, pp. 28-38. The tracking process rely on feature detectors, which are unreliable and again can require combining conflicting information from multiple feature trackers. However the methods do demonstrate the value of including constraints in the tracking process.
In other methods, 3D points exert pulls on deformable models that are connected through linking constraints, see Kakadiaris, I. A. et. al., “Active Part-Decomposition, Shape and Motion Estimation of Articulated Objects: A Physics Based Approach”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 1994, Seattle, Wash., and Metaxas, D. et. al., “Shape and Nonrigid Motion Estimation Through Physics-Based Synthesis”, IEEE Transactions on Pattern Analysis and Machine Intelligence”, Vol. 15, No. 6, pp. 590-591. These pulls adjust both the model shape and orientation through a Kalman filter framework, which incorporates the Jacobian matrix. Since the pulls are calculated in 3D, the problems of transforming a 2D force to a 3D model are avoided. The concept of using the data to pull the model into alignment is good if data is taken fast enough so that the change between data sets is small. The methods also demonstrate how the Jacobian matrix can be used to simplify torque calculations on a linked model. However several limitations of the methods must be overcome to obtain robust real-time tracking. First, using the Jacobian with a linked model usually implies singularities, which must be removed (see below). Second, data acquisition must also occur in real-time if tracking is to occur in real-time. Third, allowing the model to change shape with each data set can cause errors. If the model is allowed to conform to each data set, where for example, data along a limb becomes sparse for several frames, the model will become erroneous and can affect tracking in subsequent frames. The limb model will shrink and become erroneously small. When data along the limb returns (perhaps only a few frames later) the new points may project onto the body instead of the shrunken limb model and cause further errors.
Approaches by others have used shape-from-silhouette sensors to acquire 3D volume data, see Cheung, K. M., et. al., “A Real Time System for Robust 3D Voxel Reconstruction of Human Motions”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2000, Hilton Head, S C and, Bottino, A. et. al., “A Silhouette Based Technique for the Reconstruction of Human Movement: A Multiview Approach”, June 1996, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, Calif. The volumetric 3D data is acquired by intersecting silhouettes from multiple viewpoints. Although the reported tracking is poor with their methods, the work demonstrates that shape-from-silhouette data acquisition and tracking can both occur in real-time. Bottino's process is not able to run in real-time but provides useable results using synthetic data.
Limitations of Previous Work:                Need for 3D Data Utilization Techniques that do not use 3D data for tracking have inherent problems. Approaches using 2D data cannot overcome occlusion, have problems transforming 2D forces to a 3D model, and have difficulty establishing the actual 3D position or distance from the camera. Multiple viewpoint 2D techniques have problems performing 3D fitting in 2D, transforming 2D information to the 3D world, and of combining information from multiple viewpoints. Hence methods using 3D data are needed for development of a tracking system that can meet the requirements of all of the previously listed applications. However, several limitations of the reviewed 3D data approaches must be solved. The system must acquire data from several viewpoints, and do so in real-time without requiring markers or specialized clothing. Singularities must be removed if the Jacobian is utilized to relate changes in the data to changes in the model. Lastly, the model should not be allowed to change shape with each data set, but still must be robust enough to deal with varying data.        The model acquisition issue: Much of the previous work assumes that the 3D model is fully specified a priori and only addresses the pose recovery problem. In practice, the 3D model is parameterized by various shape parameters that need to be estimated from the images. Some work has dealt with the issue by decoupling the model acquisition and pose recovery, i.e. by requiring a separate initialization stage where either known poses or known movements simplify the acquisition of the shape parameters. Although some work represents a step forward on this matter, no method has been provided that can recover both shape and pose parameters from uncontrolled movement, e.g. the case of a person walking into a room and moving freely around.        The occlusion issue: Most methods cannot handle significant (self occlusion and do not provide criteria to stop and restart tracking of segments. There is no concept of pose ambiguity either.        The modeling issue: Human models for vision systems have been adequately parameterized with respect to shape and articulation, but few have incorporated constraints such as joint angle limits and collision constraints, and even less have considered dynamical properties such as balance. In contrast to graphics applications, they have made little or no use of color and texture cues to capture appearance. Lacking entirely is the ability to deal with loose-fitting clothes. Finally, there is also a need to model the objects the humans interact with.The real-time issue: A system and method must perform data acquisition and tracking in real-time without requiring markers or specialized clothing to be useful for most applications. All of the applications listed above, outside of motion analysis require real-time analysis to be useful.Expense and simplicity issue: Methods must be capable of operation on systems that are inexpensive and should not require large amounts of hardware to be set up and connected.The calibration issue: The data acquisition process must be relatively simple and fast to calibrate. Methods must be provided that allow for system to be calibrated by someone who is uneducated in camera calibration techniques before they can be used outside of research. Calibration also needs to be relatively fast so that systems can be moved or changed without incurring large delays due to recalibration.        
Each of the above references describes a method having unique capabilities, but no reference on it's own satisfies all of the required characteristics. There remains a need in the art for a more advanced and efficient method of motion tracking. It is therefore an object of this invention to provide a real time markerless motion tracking method utilizing linked Kinematic chains to fulfill this need.