In the early, preattentive stage of visual perception, an image may be divided into two components: (1) a structural part with distinguishable elements such as image borders and other noticeable features; and (2) a textural part without distinguishable elements. For example, in the image illustrated in FIG. 1, the structural parts are objects at near distance, such as tree trunks and branches, whose positions and shapes can be clearly perceived. In contrast, the textural part are objects at far distance whose structures become indistinguishable and simply yield different texture impressions.
Currently, there are two prevailing mathematical theories for image modeling. One is a generative theory such as the wavelet/sparse coding theory that seeks to explain an image in terms of image bases. The other is a descriptive theory such as the Markov random field (MRF) theory that seeks to explain an image in terms of filters.
Specifically, the wavelet/sparse coding theory represents images by elements selected from a dictionary of image bases that may include wavelets and ridgelets as is described by Candes et al., “Ridgelets: a Key to Higher-Dimensional Intermittency?” Phil. Trans. R. Soc. Lond. A., 357:2495-509, 1999, and Chen et al., “Atomic Decomposition by Basis Pursuit,” SIAM Journal on Scientific Computing, 20:33-61, 1999, the contents of which are incorporated herein by reference.
For example, if I is an image defined on a lattice Λ, the wavelet/sparse coding theory assumes that I is the weighted sum of a number of image bases Bk indexed by k for its position, scale, orientation, and the like. Thus, a “generative model” may be obtained as defined by:
                              I          =                                                                      ∑                                      k                    =                    1                                    K                                ⁢                                                                  ⁢                                                      c                    k                                    ⁢                                      B                    k                                                              +                        ∈                          ,                              B            k                    ∈                      Δ            B                          ,                  C          =                                    {                              c                k                            }                        ∼                          p              ⁡                              (                C                )                                                                        (        1        )            Bk is selected from a dictionary ΔB, ck are the coefficients, p(C) is the super-Gaussian distribution of coefficients C, ˜p(C) means that C follows the distribution p(C), and ε is the residual error modeled by a Gaussian white noise. Thus, the wavelet/sparse coding theory assumes a linear additive model where the image I is the result of linear superposition of Bk, k=1, . . . K, plus the Gaussian white noise ε.
FIG. 2A illustrates a typical dictionary ΔB used for sparse coding that includes Gabor and Laplacian of Gaussian (LoG) bases which are well known in the art. FIG. 2B illustrates an input image to which sparse coding is applied. FIG. 2C illustrates the image reconstructed via sparse coding with K=300 bases selected from the dictionary ΔB illustrated in FIG. 2A. FIG. 2D is a symbolic representation of the input image where each base Bk is represented by a bar at the same location, with the same elongation and orientation. Isotropic LOG bases are represented by a circle.
The example of FIGS. 2A-2D illustrate that the bases generally capture the image structures where the intensity contrasts are high. However, there are three obvious shortcomings of using a generative model such as wavelet/sparse coding alone. First, the object boundaries are blurred due to each pixel in the image being explained by a multiple number of bases instead of a single base.
Second, the textures are not well represented. One may continue to add more bases to code texture, but this will result in a non-sparse representation with a large K.
Third, the bases, as illustrated in FIG. 2D, do not line up very well in terms of spatial organization. A stronger model for regulating the spatial organization is needed to render more meaningful sketches.
The second theory for image modeling is the descriptive model that includes a FRAME (Filters, Random fields, And Maximum Entropy) model based on the MRF theory, described in further detail by Zhu et al., “Minimax Entropy Principle and Its Applications in Texture Modeling,” Neural Computation, 9(8), 1627-1660, 1997, the content of which is incorporated herein by reference. The MRF theory represents a visual pattern by pooling the responses of a bank of filters over space. The statistics of the responses define a so-called Julesz ensemble, which is described in further detail by Wu et al., “Equivalence of Julesz and Gibbs Ensembles,” Proc. of ICCV, Corfu, Greece, 1999, the content of which is incorporated herein by reference. The Julesz ensemble may be deemed to be a perceptual equivalence class where all images in the equivalence class share identical statistics and thus, render the same textural impression.
One problem with using the descriptive method alone is that it is ineffective in synthesizing sharp features, like shapes and geometry (i.e. the sketchable features). Another drawback is the computational complexity when filters of large window sizes are selected for sketchable features.
Accordingly, what is desired is a generic image model that combines the advantages of the generative and descriptive models to effectively and efficiently represent both structures and texture in a seamless manner.