The present invention is directed to data analysis, such as audio analysis, image analysis and video analysis, and more particularly to the estimation of hidden data from observed data. For image analysis, this hidden data estimation involves the placement of control points on unmarked images or sequences of images to identify corresponding fiduciary points on objects in the images.
Some types of data analysis and data manipulation operations require that xe2x80x9chiddenxe2x80x9d data first be derived from observable data. In the field of speech analysis, for example, one form of observable data is pitch-synchronous frames of speech samples. To perform linear predictive coding on a speech signal, the pitch-synchronous frames are labeled to identify vocal-tract positions. The pitch-synchronous data is observable in the sense that it is intrinsic to the data and can be easily derived using known signal processing techniques simply by the correct alignment between the speech sample and a frame window. In contrast, the vocal tract positions must be estimated either using some extrinsic assumptions (such as an acoustic waveguide having uniform length sections with each section of constant width) or using a general modeling framework with parameter values derived from an example database (e.g. linear manifold model with labeled data). Therefore, the vocal tract positions are known as xe2x80x9chiddenxe2x80x9d data.
In image processing applications, the observable data of an image includes attributes such as color or grayscale values of individual pixels, range data, and the like. In some types of image analysis, it is necessary to identify specific points in an image that serve as the basis for identifying object configurations or motions. For example, in gesture recognition, it is useful to identify the locations and motions of each of the figures. Another type of image processing application relates to image manipulation. For example, in image morphing, where one image transforms into another image, it is necessary to identify points of correspondence in each of the two images. If an image of a face is to morph into an image of a different face, for example, it may be appropriate to identify points in each of the two images that designate the outline and tip of the nose, the outlines of the eyes and the irises, the inner and outer boundaries of the mouth, the tops and bottoms of the upper and lower teeth, the hairline, etc. After the corresponding points in the two images have been identified, they serve as constraints for controlling the manipulation of pixels during the transform from one image to the other.
In a similar manner, control points are useful in video compositing operations, where a portion of an image is incorporated into a video frame. Again, corresponding points in the two images must be designated, so that the incorporated image will be properly aligned and scaled with the features of the video frame into which it is being incorporated. These control points are one form of hidden data in an image.
In the past, the identification of hidden data, such as control points in an image, was typically carried out on a manual basis. In most morphing processes, for example, a user was required to manually specify all of the corresponding control points in the beginning and ending images. If only two images are involved, this requirement is somewhat tedious, but manageable. However, in situations involving databases that contain a large number of images, the need to manually identify the control points in each image can become quite burdensome. For example, U.S. Pat. No. 5,880,788 discloses a video manipulation system in which images of different mouth positions are selected from a database and incorporated into a video stream, in synchrony with a soundtrack. For optimum results, control points which identify various fiduciary points on the image of a person""s mouth are designated for each frame in the video, as well as each mouth image stored in the database. These control points serve as the basis for aligning the image of the mouth with the image of a person""s face in the video frame. It can be appreciated that manual designation of the control points for all of the various images in such an application can become quite cumbersome.
Most previous efforts at automatically recognizing salient components of an image have concentrated on features within the image. For example, two articles entitled xe2x80x9cView-Based and Modular Eigenspaces for Face Recognition,xe2x80x9d Pentland et al, Proc. IEEE ICCVPR ""94, 1994, and xe2x80x9cProbabilistic Visual Learning for Object Detection,xe2x80x9d Moghaddam et al, Proc. IEEE CVPR, 1995, disclose a technique in which various features of a face, such as the nose, eyes, and mouth, can be automatically recognized. Once these features have been identified, an alignment point is designated for each feature, and the variations of the newly aligned features from the expected appearances of the features can be used for recognition of a face.
While this technique is useful for data alignment in applications such as face recognition, it does not by itself provide a sufficient number of data points for image manipulation techniques, such as morphing and image compositing, or other types of image processing which rely upon the location of a large number of specific points, such as general gesture or expression recognition.
Other prior art techniques for determining data points from an image employ active contour models or shape-plus-texture models. Active contour models, also known as xe2x80x9csnakesxe2x80x9d, are described in M. Kass, A. Witkin, D. Terzopoulous, xe2x80x9cSnakes, Active Contour Models.xe2x80x9d IEEE International Conference on Computer Vision, 1987, and C. Bregler and S. Omohundro, xe2x80x9cSurface Learning with Applications to Lipreading,xe2x80x9d Neural Information Processing Systems, 1994. The approaches described in these references use a relaxation technique to find a local minimum of an xe2x80x9cenergy functionxe2x80x9d, where the energy function is the sum of an external energy term, determined from the grayscale values of the image, and an internal energy term, determined from the configuration of the snake or contour itself. The external energy term typically measures the local image gradient or the local image difference from some expected value. The internal energy term typically measures local xe2x80x9cshapexe2x80x9d (e.g. curvature, length). The Bregler and Omohundro reference discloses the use of a measure of distance between the overall shape of the snake to the expected shapes for the contours being sought as an internal energy term.
Snakes can easily be thought of as providing control point locations, and the extension to snakes taught by the Bregler et al reference allows one to take advantage of example-based learning to constrain the estimated locations of these control points. However, there is no direct link between the image appearance and the shape constraints. This makes the discovery of xe2x80x9ccorrectxe2x80x9d energy function an error-prone process, which relies heavily on the experience of the user and on his familiarity with the problem at hand. The complete energy function is not easily and automatically derived from data-analysis of an example training set.
Shape-plus-texture models are described in A. Lanitis, C. J. Taylor, T. F. Cootes, xe2x80x9cA Unified Approach to Coding and Interpreting Face Images,xe2x80x9d International Conference on Computer Vision, 1995, and D. Beymer, xe2x80x9cVectorizing Face Images by Interleaving Shape and Texture Computations,xe2x80x9d A.I. Memo 1537. Shape-plus-texture models describe the appearance of an object in an image using shape descriptions (e.g. contour locations or multiple point locations) plus a texture description, such as the expected grayscale values at specified offsets relative to the shape-description points. The Beymer reference discloses that the model for texture is example-based, using an affine manifold model description derived from the principle component analysis of a database of shape-free images (i.e. the images are pre-warped to align their shape descriptions). The shape model is unconstrained (which the reference refers to as xe2x80x9cdata-drivenxe2x80x9d), and, in labeling, is allowed to vary arbitrarily based on a pixel-level mapping derived from optical flow. In the Lanitis et al. reference, both the shape and the texture models are derived separately from examples, using affine manifold model descriptions derived from principle component analyses of a database. For the shape model, the shape description locations (the control point (x,y) locations) are analyzed directly (independent of the grayscale image data) to get the shape manifold. For the texture model, as in the Beymer reference, the example grayscale images are pre-warped to provide xe2x80x9cshape-free texturexe2x80x9d and these shape-free images are analyzed to get the texture manifold model. In other references, the locations for control points on a new (unlabeled) image are estimated using an iterative technique. First, a shape description for a new image is estimated (i.e. x,y control point locations are estimated), only allowing shape descriptions which are consistent with the shape model. In the Beymer reference, this could be any shape description. Then, a xe2x80x9cshape-free texturexe2x80x9d image is computed by warping the new image data according to the estimated shape model. The distance between this shape-free texture image and the texture model is used to determine a new estimate of shape. In the case of the Beymer reference, the new estimated shape is determined by unconstrained optical flow between the shape-free unlabeled image and the closest point in the texture manifold. The Lanitis reference uses a similar update mechanism with the added constraint that the new shape model must lie on the shape manifold. After iterating until some unspecified criteria is met, the last shape description can be used to describe control point locations on the input image.
Shape-plus-texture methods give estimates for many control-point locations. They also provide well-defined example-based training methods and error criteria derived from that example-based training. However, the models which are derived for these approaches rely on estimates of unknown parametersxe2x80x94they need an estimate of shape in order to process the image data. Thus, they are forced to rely on iterative solutions. Furthermore, the shape- and texture-models do not explicitly take advantage of the coupling between shape and the image data. The models of admissible shapes are derived without regard to the image values and the models of admissible textures is derived only after xe2x80x9cnormalizing outxe2x80x9d the shape model.
When deriving models to allow estimates for unknown parameters, the coupling between observable parameters, such as image grayscale values, and the unknown parameters in the description should preferably be captured, rather than the independent descriptions of the unknown parameters and of the xe2x80x9cnormalizedxe2x80x9d known parameters. This is similar to the difference between xe2x80x9creconstructivexe2x80x9d models (models that allow data to be reconstructed with minimum error) and xe2x80x9cdiscriminativexe2x80x9d models (models that allow unknown classification data to be estimated with minimum error).
In accordance with the present invention, the determination of hidden data from observed data is achieved through a two-stage approach. The first stage involves a learning process, in which a number of sample data sets, e.g. images, are analyzed to identify the correspondence between observable data, such as visual aspects of the image, and the desired hidden data, e.g. control points. With reference to the case of image analysis, a number of representative images are labeled with control point locations relating to features of interest. An appearance-only feature model is created from aligned images of each feature. The aligned image data is rotated into standard orientations, to generate a coupled model of the aligned feature appearance and the control point locations around that feature. For example, for a coupled affine manifold model, the expected (average) vectors for both the visible image data and the control point locations are derived, from all of the individual vectors for the labeled representative images. A linear manifold model of the combined image deviations and location deviations is also determined from this data. This feature model represents the distribution of visible aspects of an image and the locations of control points, and the coupling relationship between them.
In the second stage of the process, a feature is located on an unmarked image using the appearance-only feature model. The relevant portion of the image is then analyzed to determine a vector for the visible image data. This vector is compared to the average vector for the representative images, and the deviations are determined. These values are projected onto the data model, to identify the locations of the control points in the unmarked image.
In a low-resolution implementation of the invention, certain assumptions are made regarding the correspondence between the visible image data and the control-point locations. These assumptions can be used to reduce the amount of computation that is required to derive the model from the training data, as well as that which is required to locate the control points in the labelling process. The low-resolution approach may be desirable in those applications where a high degree of precision is not required, such as in a low-resolution video morphing or compositing system. In a second implementation of the invention, additional computations are carried out during both the training and labeling steps, to provide a higher degree of precision in the location of the control points. This higher-resolution implementation provides a greater degree of control for processes such as high-resolution video morphing or compositing and the like.