Some types of data analysis and data manipulation operations require that "hidden" data first be derived from observable data. In the field of speech analysis, for example, one form of observable data is pitch-synchronous frames of speech samples. To perform linear predictive coding on a speech signal, the pitch-synchronous frames are labelled to identify vocal-tract positions. The pitch-synchronous data is observable in the sense that it is intrinsic to the data and can be easily derived using known signal processing techniques simply by the correct alignment between the speech sample and a frame window. In contrast, the vocal tract positions must be estimated either using some extrinsic assumptions (such as an acoustic waveguide having uniform length sections with each section of constant width) or using a general modeling framework with parameter values derived from an example database (e.g. linear manifold model with labelled data). Therefore, the vocal tract positions are known as "hidden" data.
In image processing applications, the observable data of an image includes attributes such as color or grayscale values of individual pixels, range data, and the like. In some types of image analysis, it is necessary to identify specific points in an image that serve as the basis for identifying object configurations or motions. For example, in gesture recognition, it is useful to identify the locations and motions of each of the figures. Another type of image processing application relates to image manipulation. For example, in image morphing, where one image transforms into another image, it is necessary to identify points of correspondence in each of the two images. If an image of a face is to morph into an image of a different face, for example, it may be appropriate to identify points in each of the two images that designate the outline and tip of the nose, the outlines of the eyes and the irises, the inner and outer boundaries of the mouth, the tops and bottoms of the upper and lower teeth, the hairline, etc. After the corresponding points in the two images have been identified, they serve as constraints for controlling the manipulation of pixels during the transform from one image to the other.
In a similar manner, control points are useful in video compositing operations, where a portion of an image is incorporated into a video frame. Again, corresponding points in the two images must be designed, so that the incorporated image will be properly aligned and scaled with the features of the video frame into which it is being incorporated. These control points are one form of hidden data in an image.
In the past, the identification of hidden data, such as control points in an image, was typically carried out on a manual basis. In most morphing processes, for example, a suer was required to manually specify all of the corresponding control points in the beginning and ending images. If only two images are involved, this requirement is somewhat tedious, but manageable. However, in situations involving databases that contain a large number of images, the need to manually identify the control points in each image can become quite burdensome. For example, U.S. Pat. No. 5,880,788, discloses a video manipulation system in which images of different mouth positions are selected from a database and incorporated into a video stream, in synchrony with a soundtrack. For optimum results, control points which identify various fiduciary points on the image of a person's mouth are designed for each frame in the video, as well as each mouth image stored in the database. These control points serve as the basis for aligning the image of the mouth with the image of a person's face in the video frame. It can be appreciated that manual designation of the control points for all of the various images in such an application can become quite cumbersome.
Most previous efforts at automatically recognizing salient components of an image have concentrated on features within the image. For example, two articles entitled "View-Based and Modular Eigenspaces for Face Recognition," Pentland et al, Proc. IEEE ICCVPR '94, 1994, and "Probabilistic Visual Learning for Object Detection," Moghaddam et al, Proc. IEEE CVPR, 1995, disclose a technique in which various features of a face, such as the nose, eyes, and mouth, can be automatically recognized. Once these features have been identified, an alignment point is designated for each feature, and the variations of the newly aligned features from the expected appearances of the features can be used for recognition of a face.
While this technique is useful for data alignment in applications such as face recognition, it does not by itself provide a sufficient number of data points for image manipulation techniques, such as morphing and image compositing, or other types of image processing which rely upon the location of a large number of specific points, such as general gesture or expression recognition.
Other prior art techniques for determining data points from an image employ active contour models or shape-plus-texture models. Active contour models, also known as "snakes", are described in M. Kass, A. Witkin, D. Terzopoulous, "Snakes, Active Contour Models." IEEE International Conference on Computer Vision, 1987, and C. Bregler and S. Omohundro, "Surface Learning with Applications to Lipreading," Neural Information Processing Systems, 1994. The approaches described in these references use a relaxation technique to find a local minimum of an "energy function", where the energy function is the sum of an external energy term, determined from the grayscale values of the image, and an internal energy term, determined from the configuration of the snake or contour itself. The external energy term typically measures the local image gradient or the local image difference from some expected value. The internal energy term typically measures local "shape" (e.g. curvature, length). The Bregler and Omohundro reference discloses the use of a measure of distance between the overall shape of the snake to the expected shapes for the contours being sought as an internal energy term.
Snakes can easily be thought of as providing control point locations, and the extension to snakes taught by the Bregler et al reference allows one to take advantage of example-based learning to constrain the estimated locations of these control points. However, there is no direct link between the image appearance and the shape constraints. This makes the discovery of "correct" energy functional an error-prone process, which relies heavily on the experience of the user and on his familiarity with the problem at hand. The complete energy functional is not easily and automatically derived from data-analysis of an example training set.
Shape-plus-texture models are described in A. Lanitis, C. J. Taylor, T. F. Cootes, "A Unified Approach to Coding and Interpreting Face Images," International Conference on Computer Vision, 1995, and D. Beymer, "Vectorizing Face Images by Interleaving Shape and Texture Computations," A. I. Memo 1537. Shape-plus-texture models describe the appearance of an object in an image using shaped descriptions (e.g. contour locations or multiple point locations) plus a texture description, such as the expected grayscale values at specified offsets relative to the shape-description points. The Beymer reference discloses that the model for texture is example-based, using an affine manifold model description derived from the principle component analysis of a database of shape-free images (i.e. the images are pre-warped to align their shape descriptions). The shape model is unconstrained (which the reference refers to as "data-driven"), and, in labelling, is allowed to vary arbitrarily based on a pixel-level mapping derived from optical flow. In the Lanitis et al. reference, both the shape and the texture models are derived separately from examples, using affine manifold model descriptions derived from principle component analyses of a database. For the shape model, the shape description locations (the control point (x,y) locations) are analyzed directly (independent of the grayscale image data) to get the shape manifold. For the texture model, as in the Beymer reference, the example grayscale images are pre-warped to provide "shape-free texture" and these shape-free images are analyzed to get the texture manifold model. In other references, the locations for control points on a new (unlabelled) image are estimated using an iterative technique. First, a shape description for a new image is estimated (i.e. x,y control point locations are estimated), only allowing shape descriptions which are consistent with the shape model. In the Beymer reference this could be any shape description. Then, a "shape-free texture" image is computed by warping the new image data according to the estimated shape model. The distance between this shape-free texture image and the texture model is used to determine a new estimate of shape. In the case of the Beymer reference, the new estimated shape is determined by unconstrained optical flow between the shape-free unlabelled image and the closest point in the texture manifold. The Lanitis reference uses a similar update mechanism with the added constraint that the new shape model must lie on the shape manifold. After iterating until some unspecified criteria is met, the last shape description can be used to describe control point locations on the input image.
Shape-plus-texture methods give estimates for many control-point locations. They also provide well-defined example-based training methods and error criteria derived from that example-based training. However, the models which are derived for these approaches rely on estimates of unknown parameters--they need an estimate of shape in order to process the image data. Thus, they are forced to rely on iterative solutions. Furthermore, the shape- and texture-models do not explicitly take advantage of the coupling between shape and the image data. The models of admissible shapes are derived without regard to the image values and the models of admissible textures is derived only after "normalizing out" the shape model.
When deriving models to allow estimates for unknown parameters, the coupling between observable parameters, such as image grayscale values, and the unknown parameters an the description should preferably be captured, rather than the independent descriptions of the unknown parameters and of the "normalized" known parameters. This is similar to the difference between "reconstructive" models (models that allow data to be reconstructed with minimum error) and "discriminative" models (models that allow unknown classification data to be estimated with minimum error).