1. Field of the Invention
This invention relates to computer vision and more specifically to the efficient computation of the low-dimensional linear subspaces that optimally contain the set of images that are generated by varying the illumination impinging on the surface of a three-dimensional object, such as a human head, for many different relative positions of that object and the viewing camera.
2. Prior Art
In a typical system for object recognition under varying viewpoint and illumination conditions, information about properties, such as the three-dimensional shape and the reflectance, of a plurality of objects 100, such as human faces and/or heads, is stored in a database 101. When such a system is in use, typically a plurality of queries 102, in the form of images of objects taken from non-fixed viewpoint and illumination conditions, is matched against said database 101. For each query image 103, matching 104 is performed against all objects 100 in said database 101.
In order to match an individual object 105 against an individual query 103, several steps are typical. First, the viewpoint of the query is estimated 106. Second, a viewpoint-dependent illumination subspace 107 is generated, as outlined below, from: the three-dimensional data about each said object 100, and the determined viewpoint 106. Further, either the illumination condition is estimated, or a similarity score is generated, or both 108.
The scores from the matching of said query 103 to said plurality of objects 100 are finally compared against each other to determine the best match 109 resulting in object 105 being recognized.
The set of images of a given object under all possible illuminations, but fixed viewpoint, will be called the illumination subspace of the object for that viewpoint. It has been both observed in practice, and argued in theory, that said illumination subspace is mostly enclosed in a low-dimensional subspace, with dimensionality M as low as Mε. See P. Hallinan, “A Low-Dimensional Representation of Human Faces for Arbitrary Lighting Conditions”, IEEE Conference on Computer Vision and Pattern Recognition, pp. 995-999 (1994); and R. Basri et al., “Lambertian Reflectance and Linear Subspaces”, to appear in the International Conference on Computer Vision, (July 2001).
There are several object, and more specifically face, recognition algorithms that are based implicitly or explicitly on this fact. See Georghiades et al., “From Few to Many: Generative Models for Recognition Under Variable Pose and Illumination”, Proceedings of the 4th International Conference on Automatic Face & Gesture Recognition, pp. 264-270, (March 2000); R. Ishiyama et al., “A New Face-Recognition System with Robustness Against Illumination Changes”, IAPR Workshop on Machine Vision Applications, pp. 127-131 (November 2000); and Basri, et al. supra.
Estimation of a Gaussian Probability Density by the Karhunen-Loève Transform
The method for finding a hierarchy of low-dimensional subspaces that optimally contain a given set of images, or more generally, snapshots, is the Karhunen-Loève Transform (KLT), which is known under a variety of names—Principal Component Analysis, Empirical Eigenfunctions, and Singular Value Decomposition (SVD)—and is closely related to Linear Regression and Linear Factor Analysis, among others. See M. Tipping et al., “Mixures of Probabilistic Principal Component Analysers”, Neural Computation, Vol. 11, No. 2, pp. 443-482 (February 1999).
The basic facts about the KLT are summarized in this section. Of importance for the disclosure further are the implementation choices, the optimality properties, and the requirements for computational time and storage space. Also, frequent references to the equations herein will be made below.
A snapshot, such as an image, will be represented by the intensity values φ (x), where {x} is a pixel grid that contains V pixels. An ensemble of T snapshots will be denoted by {φt(x)}tεT. Briefly, (see I. Joliffe, “Principal Component Analysis” (1986); and P. Penev, “Local Feature Analysis: A Statistical Theory for Information Representation and Transmission”, PhD Thesis, The Rockefeller University, (1998)) its KLT representation is                                           ϕ            t                    ⁡                      (            x            )                          =                              ∑                          r              =              1                        M                    ⁢                                    a              r              t                        ⁢                          σ              r                        ⁢                                          ψ                r                            ⁡                              (                x                )                                                                        (        1        )            where M=min(T,V) is the rank of the ensemble, {σr2} is the (non-increasing) eigenspectrum of the “spatial”                              R          ⁡                      (                          x              ,              y                        )                          ⁢                  =          Δ                ⁢                                            1              T                        ⁢                                          ∑                t                            ⁢                                                                    ϕ                    t                                    ⁡                                      (                    x                    )                                                  ⁢                                                      ϕ                    t                                    ⁡                                      (                    y                    )                                                                                =                                    ∑                              r                =                1                            M                        ⁢                                                            ψ                  r                                ⁡                                  (                  x                  )                                            ⁢                              σ                r                2                            ⁢                                                ψ                  r                                ⁡                                  (                  y                  )                                                                                        (        2        )            and the “temporal”                              C                      t            ⁢                                                   ⁢                          t              ′                                      ⁢                  =          Δ                ⁢                                            1              V                        ⁢                                          ∑                x                            ⁢                                                                    ϕ                    t                                    ⁡                                      (                    x                    )                                                  ⁢                                                      ϕ                                          t                      ′                                                        ⁡                                      (                    x                    )                                                                                =                                    ∑                              r                =                1                            M                        ⁢                                          a                r                t                            ⁢                              σ                r                2                            ⁢                              a                r                                  t                  ′                                                                                        (        3        )            covariance matrices, and {ψr(x)} and {art} are their respective orthonormal eigenvectors. When M=T<V, the diagonalization of C (eqn. 3) is the easier.
Notably, the storage of C requires O(T2) storage elements, and of R, O(V2)—the dependence of the storage requirements on the size of the problem is quadratic. Analogously, the time to compute the eigenvalues and eigenvectors (eqn. 2) of C is O(T3), and of R, O(V3)—the dependence of the computational time on the size of the problem is cubic. In practical terms, this means that solving a system that is ten times as large requires a hundred times the space and a thousand times the computational power.
The average signal power of the ensemble is                                           1                          T              ⁢                                                           ⁢              V                                ⁢                                    ∑                              x                ,                t                                      ⁢                                                                                                ϕ                    t                                    ⁡                                      (                    x                    )                                                                              2                                      =                              trR            ≡                          trR              M                                ⁢                      =            Δ                    ⁢                                    ∑                              r                =                1                            M                        ⁢                          σ              r              2                                                          (        4        )            
KLT is optimal in the sense that, among all N-dimensional subspaces (N<M), the subset of eigenmodes {Ψr}r=1N (eqn. 2) span the subspace which captures the most signal power, trRN. See M. Loeve, “Probability Theory”, (1955); and I. Joliffe, supra. For a given dimensionality N, the reconstruction of the snapshot φ (x) is                                           ϕ            N            t                    ⁡                      (            x            )                          ⁢                  =          Δ                ⁢                              ∑                          r              =              1                        N                    ⁢                                    a              r              t                        ⁢                          σ              r                        ⁢                                          ψ                r                            ⁡                              (                x                )                                                                        (        5        )            
With the standard multidimensional Gaussian model for the probability density P[φ] the information content of the reconstruction (eqn. 5) is                                           -            log                    ⁢                                           ⁢                      P            ⁡                          [                              ϕ                N                t                            ]                                      ∝                              ∑                          r              =              1                        N                    ⁢                                                                  a                r                t                                                    2                                              (        6        )            
Notably, this model is spherical—the KLT coefficients (eqn. 1) are of unit variance (eqn. 3), <art>≡1, and each of the N dimensions contributes equally to the information that is derived from the measurement, although only the leading dimensions contribute significantly to the signal power.
This is a manifestation of the fact that, in principle, even weak signals can be important if they are sufficiently rare and unexpected. In practice, however, signals are embedded in noise, which typically has constant power; and weak signals, even though important, may not be reliably detected.
The situation is complicated further by the fact that in practice every estimation is done from a finite sample (T<∞,V<∞). Nevertheless, the shape of the eigenspectrum of sample-noise covariance matrices is known—it is determined by the ratio V/T (See J. Silverstein, “Eigenvalues and Eigenvectors of Large-Dimensional Sample Covariance Matrices”, Contemporary Mathematics, Vol. 50, pp. 153-159 (1986); and A. Sengupta et al., “Distributions of Singular Values for Some Random Matrices”, Physical Review E, Vol. 60, No. 3, pp. 3389-3392, (September 1999), and this knowledge can be used to recover the true spectrum of the signal through Bayesian estimation. See R. Everson et al., “Inferring the Eigenvalues of Covariance Matrices from Limited, Noisy Data”, IEEE Transactions on Signal Processing, Vol 48, No. 7, pp. 2083-2091, (2000). Although this can serve as a basis for principled choice for the dimensionality, N, also called model selection, in the context of face recognition, this choice is typically guided by heuristic arguments. See P. Penev et al., “The Global Dimensionality of Face Space”, Proceedings of the 4th International Conference on Automatic Face & Gesture Recognition, pp. 264-270 (March 2000).
Viewpoint-Dependent Subspaces for Face Recognition
At the heart of most high-level algorithms for computer vision is the question: How does an object look from a specific viewpoint and a specific illumination condition?
Here we describe the standard method for finding the illumination subspace 107—a small number of basis images 110 {ψr(x)}(eqn. 2) which can be linearly admixed (eqn. 5) to approximate substantially all images of a given object under a fixed viewpoint, but any illumination condition.
An illumination condition at a given point, x, on the surface of an object is defined by specifying the intensity of the light, L(x,n) that is coming to the point from the direction nεS2, where n is a normal vector on the unit sphere S2 centered at that point. Typically, the assumption is made that the light sources are sufficiently far away from the object—the distance to them is much larger than the size of the object—and therefore all points on the surface see the same illumination, L(n).
In order to calculate the illumination subspace, it is customary to first generate a finite set of T illumination conditions {Lt(n)}tεT, and from them, a corresponding set of images {It(x)}tεT of the object the under a fixed viewpoint. When It(x) is identified with φt from (eqn. 2), the Illumination subspace hierarchy is determined by (eqn. 3), and an application-dependent cutoff, N (cf. eqn. 5), is chosen.
There are two general ways to generate an image It(x) for a given illumination condition: to make an actual photograph of the physical object under the given illumination condition, or to use computer graphics techniques to render a 3D model of the object. See A. Georghiades, supra.; and R. Ishiyama et al., supra.
Although the first method is more accurate for any given picture, it is very labor-intensive, and the number of pictures that can be done is necessarily small. Thus, the practical way to attempt the sampling of the space of illumination conditions is to use synthetic images obtained by computer rendering of a 3D model of the object.
When doing computer rendering, very popular is the assumption that the reflectance of the surface obeys the Lambertian reflectance model (See J. Lambert, “Photometria Sive de Mensura et Gradibus Luminus, Colorum et Umbrae,” (1760)): the intensity of the reflected light I(x) at the image point x from a point light source, L which illuminates x without obstruction is assumed to beI(x)=α(x)L•p(x)  (7)where α(x) is the intrinsic reflectance, or albedo, p(x) is the surface normal at the grid point x, and ‘•’ denotes the dot product of the two vectors.
Since the Lambertian model (eqn. 7) is linear, the sampling of the space of illumination conditions can be chosen to be simple—the It is rendered with a single point-light source illuminating the object from a pre-specified direction—nt:Lt(n)=ltδ(n−nt)  (8)where lt is the intensity of the light that comes from this direction. Moreover, the typical assumption is that light from any direction is equally probable, and lt is taken to be a constant, typically unity.
Sufficient sampling of the illumination space is expensive. Currently, the space of images of a given object under varying illumination conditions but a fixed pose (camera geometry) is sampled by sampling the illumination conditions. These samples are used to estimate the full space. The current approach suffers from the following problems:                sampling is expensive: either by varying the light physically and taking pictures, or by rendering a 3-dimensional model;        therefore, A) dense sampling is not performed, leading to an inaccurate model, or B) dense sampling is performed, which makes the system unacceptably slow.        
Therefore, currently the subspace approach cannot be practically applied to certain face recognition tasks.
Dense sampling of the illumination space consumes impractically large time and memory. In order to make the model of the subspace more accurate, denser and denser sampling of the illumination conditions is necessary. This entails quadratic growth of the necessary storage space and cubic growth in the time of finding the enclosing linear subspace. As the accuracy of the model increases, the complexity becomes limited by the number of pixels in the image; this complexity is impractically large.
Direct computation of the covariance matrix is expensive. In the regime where the complexity of the model is limited by the number of pixels in the images, an amount of computation quadratic in the number of pixels is necessary to calculate the elements of the covariance matrix. This is only practical if the individual computation is fast.
The computations for a given 3D object need to be done for arbitrary viewpoint. In typical situations, many 3D objects need to be matched to a given query, for multiple queries. The viewpoints for which the respective enclosing linear subspaces need to be calculated are not known in advance, and change with every query/model pair. When the time for finding an illumination subspace for a given viewpoint is large, practical application of the method is impossible.