Image Based Regression (IBR) is an emerging challenge in vision processing. The problem of IBR is defined as follows: Given an image x, it is desired to infer an entity y(x) that is associated with the image x. The meaning of y(x) varies significantly in different applications. For example, it could be a feature characterizing the image (e.g., estimating human age), a parameter relating to the image (e.g., the position and anisotropic spread of a tumor), or other meaningful quantity (e.g., the location of an endocardial wall).
One known vision processing method uses support vector regression to infer a shape deformation vector. Another vision processing method uses relevance vector regression to estimate a three dimensional (3D) human pose from silhouettes. However, in both of these methods, the inputs to the regressors are not the images themselves, but rather pre-processed entities, e.g., landmark locations and shape context descriptors.
Many machine learning approaches have been proposed to address regression problems in general. Data-driven approaches in particular have gained prevalence. Examples of such approaches include nonparametric kernel regression (NPR), linear methods and their nonlinear kernel variants such as kernel ridge regression (KRR) and support vector regression (SVR). However, these methods are often difficult or inefficient to directly apply to vision problems due to a number of challenges. One challenge is referred to as the curse of dimensionality. The input (i.e., image data) is highly dimensional. Ideally, in order to well represent the sample space, the number of required image samples should be exponential to the cardinality of the input space. However, in practice, the number of training samples is often extremely sparse compared with the cardinality of the input space. Overfitting is likely to happen without careful treatment.
Another challenge is varying appearance present in the image. First, there are a lot of factors that affect the appearance of the foreground object of interest. Apart from the intrinsic differences among the objects, extrinsic factors include camera system, imaging geometry, lighting conditions, makeup, etc. Second, the variation arises from the presence of background whose appearance varies too. A third variation is caused by alignment. The regression technique must either tolerate the alignment error or regress out the alignment parameter in order to work effectively.
Multiple outputs are also a challenge because the output variable is also highly dimensional. Most regression approaches, such as SVR, can deal the with single-output regression problem very robustly. Extending them to the multiple output setting is non-trivial as in the case of SVR. A naïve practice of decoupling a multiple output problem to several isolated single output tasks ignores the statistical dependence among different dimensions of the output variable.
Storage and computation are also issues to consider. The regression techniques such as Nonparametric kernel regression (NPR), Kernel Ridge Regression (KRR) and Support Vector Regression (SVR) are data-driven. There are two main disadvantages to the data-driven approaches: storage and computation. First, the techniques require storing a large amount of training data. In NPR and KRR, all training data are stored. In SVR, support vectors are stored. Because the training data are images with high dimension, storing the training images can take a lot of memory space. Second, evaluating the data-driven regression function is slow because comparing the input image with the stored training images is time-consuming.
In general, regression finds the solution to the following minimizing problem:
                                                        g              ^                        ⁡                          (              x              )                                =                      arg            ⁢                                                  ⁢                                          min                                  g                  ∈                  𝒢                                            ⁢                                                          ⁢                                                ɛ                                      p                    ⁡                                          (                                              x                        ,                        y                                            )                                                                      ⁢                                  {                                      L                    ⁡                                          (                                                                        y                          ⁡                                                      (                            x                            )                                                                          ,                                                  g                          ⁡                                                      (                            x                            )                                                                                              )                                                        }                                                                    ,                            (        1        )            where g is the set of allowed output functions, εp(x,y) takes the expectation under the generating distribution p(x,y) and the L(o,o) function is the loss function that penalizes the deviation of the regressor output g(x) from the true output y(x).
In practice, it is impossible to compute the expectation sine the distribution p(x,y) is unknown. Given a set of training examples {(xn,y(xn))}n=1N, the cost function εp(x,y)L(y(x),g(x)) is approximated as the training error
      J    ⁡          (      g      )        =            ∑              n        =        1            N        ⁢                  L        ⁡                  (                                    y              ⁡                              (                                  x                  n                                )                                      ,                          g              ⁡                              (                                  x                  n                                )                                              )                    ⁢              /            ⁢              N        .            
If the number of samples N is infinitely large, the above approximation is exact by the law of the large number. Unfortunately, a practical value of N is never large enough, especially when dealing with image data and high-dimensional output parameters. A more severe problem is overfitting: given a limited number of training examples, it is easy to construct a function g(x) that yields a zero training error. To combat the overfitting, additional regularization constraints are often used, which results in a combined cost function (ignoring the scaling factor N−1)
                                          J            ⁡                          (              g              )                                =                                                    ∑                                  n                  =                  1                                N                            ⁢                              L                ⁡                                  (                                                            y                      ⁡                                              (                                                  x                          n                                                )                                                              ,                                          g                      ⁡                                              (                                                  x                          n                                                )                                                                              )                                                      +                          λ              ⁢                                                          ⁢                              R                ⁡                                  (                  g                  )                                                                    ,                            (        2        )            where λ>0 is the regularization coefficient that controls the degree of regularization and R(g) is the regularization term. Regularization often imposes certain smoothness on the output function or reflects some prior belief about the output.
NPR is a smoothed version of k-nearest neighbor (kNN) regression. The kNN regressor approximates the conditional mean, an optimal estimate in the L2 sense. NPR takes the following form:
                              g          ⁡                      (            x            )                          =                                            ∑                              n                =                1                            N                        ⁢                                                            h                  σ                                ⁡                                  (                                      x                    ;                                          x                      n                                                        )                                            ⁢                              y                ⁡                                  (                                      x                    n                                    )                                                                                        ∑                              n                =                1                            N                        ⁢                                          h                σ                            ⁡                              (                                  x                  ;                                      x                    n                                                  )                                                                        (        3        )            where hσ(∘;xn) is a kernel function. The most widely used kernel function is the RBF kernel
                                          h            σ                    ⁡                      (                          x              ;                              x                n                                      )                          =                                            rbf              σ                        ⁡                          (                              x                ;                                  x                  n                                            )                                =                      exp            ⁡                          (                              -                                                                                                                          x                        -                                                  x                          n                                                                                                            2                                                        2                    ⁢                                                                                  ⁢                                          σ                      2                                                                                  )                                                          (        4        )            The RBF kernel has a noncompact support. Other kernel functions with compact supports such as the Epanechnikov kernel can be used too. In general, when confronted with the scenario of image based regression, NPR, albeit smooth, tends to overfit the data, i.e., yielding a low bias and a high variance.
KRR assumes that the multiple output regression function takes a linear form:
                                          g            ⁡                          (              x              )                                =                                    ∑                              n                =                1                            N                        ⁢                                          α                n                            ⁢                              k                ⁡                                  (                                      x                    ;                                          x                      n                                                        )                                                                    ,                            (        5        )            where k(x:xn) is a reproducing kernel function and αn is a q×1 vector that weights the kernel function. The choices for the reproducing kernel include the RBF kernel, the polynomial kernel and so on. The solution to the multiple output KRR derived from the training data isg(x)=Y(K+λI)−1κ(x),  (6)where Yq×N=[y(x1),y(x2), . . . , y(xN)] is the training output matrix, KN×N=[k(xi;xj)] is the Gram matrix for the training data, and κ(x)N×1=[k(x;x1), k(x;x2), . . . , k(x;xN)]T.
In general, when a linear kernel is used, KRR tends to underfit the data, i.e., yielding a high bias and a low variance, because it uses a simple linear form. Using the nonlinear kernel function often gives enhanced performance. One computational difficulty of KRR lies in inverting the N×N matrix κ+λI.
SVR is a robust regression method. Its current formulation works for single output data, i.e., q=1. SVR minimizes the following cost function
                                                        1              2                        ⁢                                                          w                                            2                                +                      C            ⁢                                          ∑                                  n                  =                  1                                N                            ⁢                                                                                                            y                      ⁡                                              (                                                  x                          n                                                )                                                              -                                          g                      ⁡                                              (                                                  x                          n                                                )                                                                                                              ∈                                                    ,                            (        7        )            where |∘|ε is an ε-insensitive function,
      g    ⁡          (      x      )        =            ∑              n        =        1            N        ⁢                  w        n            ⁢              k        ⁡                  (                      x            ;                          x              n                                )                    with k(x;xn) being a reproducing kernel function and wn, being its weight, and w=[w1, w2, . . . , wn]T. Because some of the coefficients wn, which can be found through a quadratic programming procedure, are zero-valued, the samples xn associated with nonzero weights are called support vectors.
SVR strikes a good balance between bias and variance tradeoff and hence very robust. Unfortunately, directly applying SVR to the multiple output regression problem is difficult. There is a need for a regressor that is able to target a multiple output setting that is learned using boosting.