Three-dimensional (3D) imaging systems, such as, stereovision cameras, time-of-flight (TOF) cameras and structured-light cameras, produce a temporal sequence of depth maps which are 2D images providing at least X, Y, Z coordinates for each imaged point of a scene. The X and Y coordinates may indicate the horizontal and vertical position of the pixel in the camera sensor matrix array, and the Z coordinate may indicate the distance of the imaged point in the scene to the imaging device. Alternatively, each imaged point of a scene may comprise X, Y, Z coordinates corresponding to its location in a 3D space, the coordinates being expressed with respect to a 3D coordinate system having an origin, for example, at a reference point. The camera location may be selected as the reference point in order to specify a camera coordinate system. However, the imaged points of a scene may also be expressed in other 3D coordinate systems where the reference point is not set at the camera location, but is determined to be at a point location in the real world scene being imaged such that the X, Y, Z coordinates of each imaged point of the scene represents a real position in a so-called world coordinate system. Conversions between the real world and camera coordinate systems, including certain limits, can simply be performed by applying a mathematical transformation using, for example, a calibration matrix, in order to perform a geometric projection of the particular 3D coordinates.
Whatever coordinate system is used, the produced depth maps may then be processed in order to detect, localise, track, segment, and analyse objects, including articulated human bodies or animal bodies, namely, the users, in the scene using specific 2D or 3D analysis methods such as described in WO-A-2011/080282. One result of such a method may, in particular, assist in defining a set of 3D points as being the ones that represent, in the virtual world, each body or object in the real scene.
Processing such a 3D point cloud representing an object or at least a body of a user over time allows mapping, fitting and tracking of a model or of other kinds of representations of the object or body. For example, a skeletal representation of a human body, or of an object, may be mapped, fitted and tracked in order to monitor or control the corresponding virtual representation of the object or body in a virtual environment with respect to movements of the object or user in the real world. This is termed motion capture.
In prior art image processing techniques, the usual method for tracking a skeleton within a scene requires the use of markers associated with a user whose skeleton is to be tracked, the markers being tracked rather than the user himself/herself. In some instances, these markers are attached to a suit or to another item that is worn by the user.
More recently, range imaging devices output data, namely, depth maps, started to be used for marker-less skeletal tracking. Using such imaging devices, the tracking relies on 2D or 3D motion detection and on some estimation techniques, mixed with body part recognition using pattern matching techniques. In addition, pose recognition and estimations also mainly uses matching techniques with a model.
In US2010/0034457, a computer-implemented method for modelling humanoid forms from depth maps is disclosed. More specifically, the method includes receiving a depth map of a scene containing a body of a humanoid subject. The depth map includes a matrix of pixels, each pixel corresponding to a respective location in the scene and having a respective pixel value indicative of a distance from a reference location to the respective location. The depth map is segmented so as to find a contour of the body which is subsequently processed in order to identify a torso and one or more limbs of the considered subject. By analysing a disposition of at least one of the identified limbs in the depth map, input signals are generated to control an application program running on a computer.
In US-A-2011/0052006, a method is described for extracting a skeleton from a depth map. The method includes receiving a temporal sequence of depth maps of a scene containing a humanoid form having a head. The depth maps include a matrix of pixels having respective pixel depth values. A digital processor processes at least one of the depth maps so as to find a location of the head and estimates dimensions of the humanoid form based on the location thereof, the humanoid standing in a calibration pose or posture. The processor tracks movements of the humanoid form over the sequence using the estimated dimensions, body parts identifications and motion estimation methods.
In US-A-2010/0194872, systems and methods for capturing depth information of a scene are used to process a human input. A depth image of a scene is captured by an imaging device. The image capture is dependent of the orientation of the camera with respect to the scene. The depth image is then analysed to determine whether the depth image includes both human and non-human targets. For example, the depth image may include one or more targets including a human target and some non-human targets. According to one embodiment, each target is flood-filled and compared to a pattern to determine whether the target is a human target or not. If one or more of the targets in the depth image comprises a human target, the human target is scanned, and a skeletal model of the human target is generated based on the scan of a binary mask of the human target from which body parts are identified.
In US-A-2011/0249865, an image processing based method for tracking marker-less motions of a subject in a three-dimensional (3D) environment is disclosed wherein input images comprising depth information is included. The method utilises two-dimensional (2D) lower and higher body parts detection units using movement detection principle. These detection units are associated with several 3D body part detection units using lower and higher body part models to localise, in the space, individual candidates for each of the 3D body parts. A model rendering unit is used to render the complete model in accordance with some predicted body pose.
In US2011/0292036, a method using a depth sensor with an application interface is disclosed. The method comprises performing data processing on a depth map of a scene containing a body of a humanoid subject. In a similar method to that used in US2010/0034457 discussed above, the depth map includes a matrix of pixels, each pixel corresponding to a respective location in the scene and having a respective pixel depth value indicative of a distance from a reference plane to the respective location. The depth map is then processed in a digital processor to extract a skeleton of at least a part of the body of the humanoid subject, the skeleton including multiple joints having respective coordinates and comprising at least two shoulder joints having different, respective depth values which are used for defining a coronal plane of the body that is rotated by at least 10° relative to a reference plane. An application program interface (API) indicates at least the coordinates of the joints.
In US2011/0211754, a method for tracking body parts by combined colour image and depth processing is disclosed. This method relies on image processing techniques and includes receiving a depth image of a scene containing a human subject and receiving a colour image of the scene containing the human subject. A part of a body of the subject is identified in at least one of the images. The quality of both the depth image and the colour image is evaluated, and in response to the quality, one of the images is selected to be dominant in processing of the part of the body in the images. The identified part is localised in the dominant image, while using supporting data from the other image.
Despite some existing method disclose skeleton mapping according to some specific embodiments, one important concern not properly addressed is when using a depth map or a corresponding 3D point cloud representing each imaged point of a scene so that to provide a robust and efficient method having a processing time independent of the native depth map resolution, in particular, when full processing of a raw depth map to a robust and efficient fitted skeleton is to be performed in real time on low-end hardware platforms.
Additionally, there is no disclosure of any object fitting and tracking method that is capable of handling occlusion of segments and of accommodating joint limits, velocity constraints, and collision constraints at the same time. Moreover, none of the existing methods are able to recover from posture errors while being morphology agnostic to the user or to the object being tracked. Furthermore, none of the existing methods make use of an estimation of central axes of parts of an object to improve the fitting and tracking of its skeletal representation, or of multi-criteria iterative energy minimisation for the fitting process.