Sparse representation is a widely used technique for a broad range of signal and image processing applications. For instance, sparse representations may be used for labeling objects in images to identify and classify what objects are present. Given a signal s and a dictionary matrix D, sparse coding is the inverse problem of finding the sparse representation x with only a few non-zero entries such that Dx≈s. The process of arriving at these identifications is non-trivial and complex.
Different approaches to sparse coding have been proposed. However, most sparse coding algorithms optimize a functional consisting of a data fidelity term and a sparsity inducing penalty of the form
                                                                                          arg                  ⁢                                                                          ⁢                  min                                                                                    x                                              ⁢                      1            2                    ⁢                                                                                    D                  ⁢                                                                          ⁢                  x                                -                s                                                    2            2                          +                  λ          ⁢                                          ⁢                      R            ⁡                          (              x              )                                                          (        1        )            
or constrained forms such as
                                                                                                              arg                    ⁢                                                                                  ⁢                    min                                                                                                x                                                      ⁢                          R              ⁡                              (                x                )                                      ⁢                                                  ⁢            such            ⁢                                                  ⁢            that            ⁢                                                  ⁢                                                                            Dx                  -                  s                                                            2                                ≤          ε                ⁢                                  ⁢        or                            (        2        )                                                                                                      arg                  ⁢                                                                          ⁢                  min                                                                                    x                                              ⁢                                                                  Dx                -                s                                                    2                    ⁢                                          ⁢          such          ⁢                                          ⁢          that          ⁢                                          ⁢                      R            ⁡                          (              x              )                                      ≤        τ                            (        3        )            
where D is a dictionary matrix, x is the sparse representation, λ is a regularization parameter, ε is a reconstruction error threshold, τ is a sparsity threshold, and R(•) denotes a sparsity inducing function such as the l1 norm or the l0 “norm.” While the l0 norm does not conform to all of the requirements of a real norm, it is convenient to write it using norm notation. Regularization parameter λ controls the relative importance of the data fidelity term that penalizes a solution that does not match the data and the regularization term that penalizes a solution that is not sparse (i.e., one that has too many non-zero entries). When applied to images, this decomposition is usually applied independently to a set of overlapping image patches covering the image. This approach is convenient, but often necessitates somewhat ad hoc subsequent handling of the overlap between patches. This results in a representation over the whole image that is suboptimal.
The two leading families of sparse coding methods are: (1) a wide variety of convex optimization algorithms (e.g., Alternating Direction Method of Multipliers (ADMM)) for solving Eq. (1) when R(x)=∥x∥1; and (2) a family of greedy algorithms (e.g., matching Pursuit (MP) and Orthogonal Matching Pursuit (OMP)) for providing an approximate solution for Eq. (2) or Eq. (3) when R(x)=∥x∥0.
If the dictionary D is analytically defined and corresponds to a linear operator with a fast transform (e.g., the Discrete Wavelet Transform), a representation for an entire signal or image can readily be computed. More recently, however, it has been realized that improved performance can be obtained by learning the dictionary from a set of training data relevant to a specific problem. This inverse problem is known as “dictionary learning.” In this case, computing a sparse representation for an entire signal is not feasible, with the usual approach being to apply the decomposition independently to a set of overlapping blocks covering the signal, as illustrated in independent sparse representation 100 of FIG. 1. This approach is relatively straightforward to implement, but results in a representation that is multi-valued and suboptimal over the signal as a whole, often necessitating somewhat ad hoc handling of the overlap between blocks.
An optimal representation for a given block structure would involve solving a single optimization problem for a block matrix dictionary {tilde over (D)} with blocks consisting of appropriately shifted versions of the single-block dictionary D, as illustrated in jointly optimized sparse representation 200 of FIG. 2. If the block structure is modified such that the step between blocks is a single sample, as in equivalence 300 of FIG. 3, the representation on the resulting block dictionary {tilde over (D)} is both optimal for the entire signal and shift-invariant. Furthermore, in this case, the linear representation on the block dictionary {tilde over (D)} can be expressed as a sum of convolutions of columns dm of D with a set of spatial coefficient maps corresponding to the subsets of x associated with different shifts of each column of D. In other words, by a straightforward change of indexing of coefficients, the representation {tilde over (D)}x≈s is equivalent to Σmdm*xm≈s. This form of sparse representation is referred to here as a “convolutional” sparse representation.
Convolutional BPDN
While convolutional forms can be constructed for each of Eq. (1)-(3), the focus herein is on the convolutional form of Eq. (1) with R(x)=∥x∥1, i.e. the BPDN problem
                                                                                          arg                  ⁢                                                                          ⁢                  min                                                                                    x                                              ⁢                      1            2                    ⁢                                                                                    D                  ⁢                                                                          ⁢                  x                                -                s                                                    2            2                          +                  λ          ⁢                                          ⁢                                                  x                                      1                                              (        4        )            
Recently, these techniques have been applied to computer vision problems such as face recognition and image classification. In this context, convolutional sparse representations were introduced, replacing Eq. (4) with
                                                                                          arg                  ⁢                                                                          ⁢                  min                                                                                                      {                                      x                    m                                    }                                                              ⁢                      1            2                    ⁢                                                                                                        ∑                    m                                    ⁢                                                            d                      m                                        *                                          x                      m                                                                      -                s                                                    2            2                          +                  λ          ⁢                                    ∑              m                        ⁢                                                                            x                  m                                                            1                                                          (        5        )            
where {dm} is a set of M dictionary filters, * denotes convolution, and {xm} is a set of coefficient maps, each of which is the same size as s. Here, s is a full image, and the set of dictionary filters {dm} is usually much smaller. For notational simplicity, s and xm are considered to be N dimensional vectors, where N is the number of pixels in an image, and the notation {xm} is adopted to denote all M of xm stacked as a single column vector. The derivations presented here are for a single image with a single color band, but the extension to multiple color bands for both image and filters and simultaneous sparse coding of multiple images is also possible. The extension to color and other multi-band images is mathematically straightforward, at the price of some additional notations.
The original algorithm proposed for convolutional sparse coding adopted a splitting technique with alternating minimization of two sub-problems—the first consisting of the solution of a large linear system via an iterative method and the second consisting of a simple shrinkage. The resulting alternating minimization algorithm is similar to one that would be obtained within an ADMM framework, but requires continuation on the auxiliary parameter that controls the strength of the additional term penalizing the violation of the constraint inherent in the splitting. All computation is performed in the spatial domain since it was expected that computation in the discrete Fourier transform (DFT) domain would result in undesirable boundary artifacts. The DFT converts a finite list of equally spaced samples of a function into a list of coefficients of a finite combination of complex sinusoids, ordered by their frequencies. DFTs convert the sampled function from its original domain, such as time or a position along a line, to the frequency domain, as is known. Other algorithms that have been proposed for this problem include coordinate descent and a proximal gradient method, both operating in the spatial domain.
Recently, an ADMM algorithm operating in the DFT domain has been proposed for dictionary learning for convolutional sparse representations. The use of fast Fourier transforms (FFTs) in solving the relevant linear systems has been shown to give substantially better asymptotic performance than the original spatial domain method. FFTs are algorithms that compute the DFT and its inverse, as is known. Also, evidence is presented to support the claim that the resulting boundary effects are not significant.
However, the computation time for solving for linear systems in this algorithm, which dominates the computational cost of the algorithm, is (M3N). This significant cost renders the algorithm inefficient and impractical for large values of M (i.e., dictionaries with a large number of filters), and is likely the reason that this form of sparse representation has received little attention for image processing applications. Accordingly, an improved, more efficient approach may be beneficial.