The present invention relates to the estimation of human body shape using a low-dimensional 3D model using sensor data and other forms of input data that may be imprecise, ambiguous or partially obscured.
The citation of published references in this section is not an admission that the publications constitute prior art to the presently claimed subject matter.
Body scanning technology has a long history and many potential applications ranging from health (fitness and weight loss), to entertainment (avatars and video games) and the garment industry (custom clothing and virtual “try-on”). Current methods however are limited in that they require complex, expensive or specialized equipment to capture three-dimensional (3D) body measurements.
Most previous methods for “scanning” the body have focused on highly controlled environments and used lasers, millimeter waves, structured light or other active sensing methods to measure the depth of many points on the body with high precision. These many points are then combined into a 3D body model or are used directly to estimate properties of human shape. All these previous methods focus on making thousands of measurements directly on the body surface and each of these must be very accurate. Consequently such systems are expensive to produce.
Because these previous methods focus on acquiring surface measurements, they fail to accurately acquire body shape when a person is wearing clothing that obscures their underlying body shape. Most types of sensors do not actually see the underlying body shape making the problem of estimating that shape under clothing challenging even when high-accuracy range scanners are used. A key issue limiting the acceptance of body scanning technology in many applications has been modesty most systems require the user to wear minimal or skin-tight clothing.
There are several methods for representing body shape with varying levels of specificity: 1) non-parametric models such as visual hulls (Starck and Hilton 2007, Boyer 2006), point clouds and voxel representations (Cheung et al. 2003); 2) part-based models using generic shape primitives such as cylinders or cones (Deutscher and Reid 2005), superquadrics (Kakadiaris and Metaxas 1998; Sminchisescu and Telea 2002) or “metaballs” (Flankers and Fua 2003); 3) humanoid models controlled by a set of pre-specified parameters such as limb lengths that are used to vary shape (Grest et al. 2005; Hilton et al. 2000; Lee et al. 2000); 4) data driven models where human body shape variation is learned from a training set of 3D body shapes (Anguelov et al. 2005; Balan et al. 2007a; Seo et al. 2006; Sigal et al. 2007, 2008).
Machine vision algorithms for estimating body shape have typically relied on structured light, photometric stereo, or multiple calibrated camera views in carefully controlled settings where the use of low specificity models such as visual hulls is possible. As the image evidence decreases, more human-specific models are needed to recover shape. In both previous scanning methods and machine vision algorithms, the sensor measurements are limited, ambiguous, noisy or do not correspond directly to the body surface. Several methods fit a humanoid model to multiple video frames, depth images or multiple snapshots from a single camera (Sminchisescu and Telea 2002, Grest et al. 2005, Lee et al. 2000). These methods estimate only limited aspects of body shape such as scaling parameters or joint locations in a pre-processing step yet fail to capture the range of natural body shapes.
More realism is possible with data-driven methods that encode the statistics of human body shape. Seo et al. (2006) use a learned deformable body model for estimating body shape from one or more photos in a controlled environment with uniform background and with the subject seen in a single predefined posture with minimal clothing. They require at least two views (a front view and a side view) to obtain reasonable shape estimates. They choose viewing directions in which changes in pose are not noticeable and fit a single model of pose and shape to the front and side views. They do not combine body shape information across varying poses or deal with shape under clothing. The camera is stationary and calibrated in advance based on the camera height and distance to the subject. They optimize an objective function that combines a silhouette overlap term with one that aligns manually marked feature points on the model and in the image.
There are several related methods that use a 3D body model called SCAPE (Anguelov et al. 2005). While there are many 3D graphics models of the human body, SCAPE is low dimensional and it factors changes in shape due to pose and identity. Anguelov et al. (2005) define the SCAPE model and show how it can be used in several graphics applications. They dealt with detailed laser scan data of naked bodies and did not fit the model to image data of any kind.
In Balan et al. (2007a) the SCAPE model was fit to image data for the first time. They projected the 3D model into multiple calibrated images and compared the projected body silhouette with foreground regions extracted using a known static background. An iterative importance sampling method was used to estimate the pose and shape that best explained the observed silhouettes. That method worked with as few as 3-4 cameras if they were placed appropriately and calibrated accurately. The method did not deal with clothing, estimating shape across multiple poses, or un-calibrated imagery.
If more cameras are available, a visual hull or voxel representation can be extracted from image silhouettes (Laurentini 1994) and the body model can be fit to this 3D representation. Mundermann et al. (2007) fit a body model to this visual hull data by first generating a large number of example body shapes using SCAPE. They then searched this virtual database of body shapes for the best example body that fit the visual hull data. This shape model was then kept fixed and segmented into rigid parts. The body was tracked using an Iterative Closest Point (ICP) method to register the partitioned model with the volumetric data. method required 8 or more cameras to work accurately.
There exist a class of discriminative methods that attempt to establish a direct mapping between sensor features and 3D body shape and pose. Many methods exist that predict pose parameters, but only Sigal et al. (2007, 2008) predict shape parameters as well. Discriminative approaches do not use an explicit model of the human body for fitting, but may use a humanoid model for generating training examples. Such approaches are computationally efficient but require a training database that spans all possible poses, body shapes, and/or scene conditions (camera view direction, clothing, lighting, background, etc.) to be effective. None of these methods deal with clothing variations. Moreover the performance degrades significantly when the image features are corrupted by noise or clutter. In such cases, a generative approach is more appropriate as it models the image formation process explicitly, where a discriminative approach is typically used for initializing a generative approach.
Grauman et al. (2003) used a 3D graphics model of the human body to generate many training examples of synthetic people in different poses. The model was not learned from data of real people and lacked realism. Their approach projected each training body into one or more synthetic camera views to generate a training set of 2D contours. Because the camera views must be known during training, this implies that the locations of the multiple cameras are roughly calibrated in advance (at training time). They learned a statistical model of the multi-view 2D contour rather than the 3D body shape and then associated the different contour parameters with the structural information about the 3D body that generated them. Their estimation process involved matching 2D contours from the learned model to the image and then inferring the related structural information (they recovered pose and did not show the recovery of body shape). Our approach of modeling shape in 3D is more powerful because it allows the model to be learned independent of the number of cameras and camera location. Our 3D model can be projected into any view or any number of cameras and the shape of the 3D model can be constrained during estimation to match known properties. Grauman et al. (2003) did not deal with estimating shape under clothing or the combination of information about 3D body shape across multiple articulated poses. Working with a 3D shape model that factors pose and shape allows us to recover a consistent 3D body shape from multiple images where each image may contain a different pose.
None of the methods above are able to accurately estimate detailed body shape from un-calibrated perspective cameras, monocular images, or people wearing clothing.
Hasler et al. (2009c) are the first to fit a learned parametric body model to 3D laser scans of dressed people. Their method uses a single pose of the subject and requires the specification of sparse point correspondences between feature locations on the body model and the laser scan; a human operator provides these. They use a body model (Hasler et al. 2009b) similar to SCAPE in that it accounts for articulated and non-rigid pose and identity deformations, but unlike SCAPE, it does not factor pose and shape in a way that allows for the pose to be adjusted while the identity of body shape is kept constant. This is important since estimating shape under clothing is significantly under-constrained in a single pose case, combining information from multiple articulated poses can constrain the solution. Their method provides no direct way to ensure that the estimated shape is consistent across different poses. They require a full 360 degree laser scan and do not estimate shape from images or range sensing cameras.