Sparse Representation of Signals
In recent years there has been a growing interest in the study of sparse representation of signals. Using an overcomplete dictionary that contains prototype signal-atoms, signals are described as sparse linear combinations of these atoms. Applications that use sparse representation are many and include compression, regularization in inverse problems, feature extraction, and more. Recent activity in this field concentrated mainly on the study of pursuit algorithms that decompose signals with respect to a given dictionary. Designing dictionaries to better fit the above model can be done by either selecting one from a pre-specified set of linear transforms, or by adapting the dictionary to a set of training signals. Both these techniques have been considered in recent years, but this topic is largely still open.
Using an overcomplete dictionary matrix Dεn×K that contains K prototype signal-atoms for columns, {dj}j=1K, it is assumed that a signal yεn can be represented as a sparse linear combination of these atoms. The representation of y may either be exact y=Dx, or approximate, y≈Dx, satisfying |y−Dx′|p≦ε. The vector xεK displays the representation coefficients of the signal y. In approximation methods, typical norms used for measuring the deviation are the lp-norms for p=1, 2 and ∞.
If n<K and D is a full-rank matrix, an infinite number of solutions are available for the representation problem, hence constraints on the solution must be set. The solution with the fewest number of nonzero coefficients is certainly an appealing representation. This, sparsest representation, is the solution of either
                                                        (                              P                0                            )                        ⁢                                                  ⁢                                          min                x                            ⁢                                                                                        x                                                        0                                ⁢                                                                  ⁢                subject                ⁢                                                                  ⁢                to                ⁢                                                                  ⁢                y                                              =          Dx                ,                                  ⁢        or                            (        1        )                                                                    (                              P                                  0                  ,                  ε                                            )                        ⁢                                          min                x                            ⁢                                                                                        x                                                        0                                ⁢                                                                  ⁢                subject                ⁢                                                                  ⁢                to                ⁢                                                                  ⁢                                                                                                y                      -                      Dx                                                                            2                                                              ≤          ε                ,                            (        2        )            where ∥•∥0 is the l0 norm, counting the non zero entries of a vector.
Applications that can benefit from the sparsity and overcompleteness concepts (together or separately) include compression, regularization in inverse problems, feature extraction, and more. Indeed, the success of the JPEG2000 coding standard can be attributed to the sparsity of the wavelet coefficients of natural images. In denoising (removal of noise from noisy data so to obtain the unknown or original signal), wavelet methods and shift-invariant variations that exploit overcomplete representation, are among the most effective known algorithms for this task. Sparsity and overcompleteness have been successfully used for dynamic range compression in images, separation of texture and cartoon content in images, inpainting (changing an image so that the change is not noticeable by an observer) and restoration, and more.
In order to use overcomplete and sparse representations in applications, one needs to fix a dictionary D, and then find efficient ways to solve (1) or (2). Recent activity in this field has been concentrated mostly on the study of so called pursuit algorithms that represent signals with respect to a known dictionary, and approximate the solutions of (1) and (2). Exact determination of sparsest representations proves to be an NP-hard problem. Thus, approximate solutions are considered instead, and several efficient pursuit algorithms have been proposed in the last decade. The simplest ones are the Matching Pursuit (MP) and the Orthogonal Matching Pursuit (OMP) algorithms. Both are greedy algorithms that select the dictionary atoms sequentially. These methods are very simple, involving the computation of inner products between the signal and dictionary columns' and possibly deploying some least squares solvers. Both (1) and (2) are easily addressed by changing the stopping rule of the algorithm.
A second well known pursuit approach is the Basis Pursuit (BP). It suggests a convexisation of the problems posed in (1) and (2), by replacing the l0-norm with an l1-norm. The Focal Under-determined System Solver (FOCUSS) is very similar, using the lp-norm with p≦1, as a replacement to the l0-norm. Here, for p<1 the similarity to the true sparsity measure is better, but the overall problem becomes non-convex, giving rise to local minima that may divert the optimization. Lagrange multipliers are used to convert the constraint into a penalty term, and an iterative method is derived based on the idea of iterated reweighed least-squares that handles the lp-norm as an l2 weighted one.
Both the BP and FOCUSS can be motivated based on Maximum A Posteriori (MAP) estimation and indeed several works used this reasoning directly. The MAP can be used to estimate the coefficients as random variables by maximizing the posterior P(x|y,D)αP(y|D,x)P(x). The prior distribution on the coefficient vector x is assumed to be a super-Gaussian Independent Identically-Distributed (iid) distribution that favors sparsity. For the Laplace distribution this approach is equivalent to BP.
Extensive study of these algorithms in recent years has established that if the sought solution, x, is sparse enough, these techniques recover it well in the exact case. Further work considered the approximated versions and has shown stability in recovery of x. The recent front of activity revisits those questions within a probabilistic setting, obtaining more realistic assessments on pursuit algorithms performance and success. The properties of the dictionary D set the limits that may be assumed on the sparsity that consequently ensure successful approximation. Interestingly, in all the works mentioned so far, there is a preliminary assumption that the dictionary is known and fixed. There is a great need to address the issue of designing the proper dictionary in order to better fit the sparsity model imposed.
The Choice of the Dictionary
An overcomplete dictionary D that leads to sparse representations can either be chosen as a pre-specified set of functions, or designed by adapting its content to fit a given set of signal examples.
Choosing a pre-specified transform matrix is appealing because it is simpler. Also, in many cases it leads to simple and fast algorithms for the evaluation of the sparse representation. This is indeed the case for overcomplete wavelets, curvelets, contourlets, steerable wavelet filters, short-time-Fourier transforms, and more. Preference is typically given to tight frames that can easily be pseudo-inverted. The success of such dictionaries in applications depends on how suitable they are to sparsely describe the signals in question. Multiscale analysis with oriented basis functions and a shift-invariant property are guidelines in such constructions.
There is need to develop a different route for designing dictionaries D based on learning, and find the dictionary D that yields sparse representations for the training signals. Such dictionaries have the potential to outperform commonly used pre-determined dictionaries. With ever-growing computational capabilities, computational cost may become secondary in importance to the improved performance achievable by methods which adapt dictionaries for special classes of signals.
Sparse coding is the process of computing the representation coefficients, x, based on the given signal y and the dictionary D. This process, commonly referred to as “atom decomposition”, requires solving (1) or (2), and this is typically done by a “pursuit algorithm” that finds an approximate solution. Three popular pursuit algorithms are the Orthogonal Matching Pursuit (OMP), Basis Pursuit (BP) and the Focal Under-determined System Solver (FOCUSS).
Orthogonal Matching Pursuit is a greedy step-wise regression algorithm. At each stage this method selects the dictionary element having the maximal projection onto the residual signal. After each selection, the representation coefficients with regarding to the so far chosen atoms are found via least-squares. Formally, given a signal yεn, and a dictionary D with K l2-normalized columns {dk}k=1K, one starts by setting r0=y, k=1, and performing the following steps:
1) Select the index of the next dictionary element ik=argmaxw|<rk−1,dw>|;
2) Update the current approximation yk=argminyk∥y−yk∥22 such that ykεspan{di1, di2, . . . , dik}; and
3) Update the residual rk=y−yk.
The algorithm can be stopped after a predetermined number of steps, hence after having selected a fixed number of atoms. Alternatively, the stopping rule can be based on norm of the residual, or on the maximal inner product computed in the next atom selection stage.
OMP is an appealing and very simple to implement algorithm. Unlike other methods, it can be easily programmed to supply a representation with an a priori fixed number of non-zero entries—a desired outcome in the training of dictionaries. There are several variants of the OMP that suggest (i) skipping the least-squares and using inner product itself as a coefficient, or (ii) applying least-squares per every candidate atom, rather than just using inner-products at the selection stage, or (iii) doing faster and less precise search, where instead of searching for the maximal inner product, a nearly maximal one is selected, thereby speeding up the search.
Theoretic study has shown that the OMP solution is also the sparsest available one (solving (1)) if some conditions on the dictionary and on the exact solution prevail. More recent work has shown that the above is also true for the approximation version (2). These results and some later ones that apply to the basis pursuit and FOCUSS involve a key feature of the dictionary D called the mutual incoherence and defined as:
                    μ        =                              max                          i              ≠              j                                ⁢                                                                                    d                  i                  T                                ⁢                                  d                  j                                                                    .                                              (        3        )            
This measure quantifies how similar two columns of the dictionary can be. Given μ, if the sparse representation to be found has fewer than (1/μ) non-zeros, the OMP and its variants are guaranteed to succeed in recovering it.
Basis Pursuit (BP) algorithm proposes the replacement of the l0-norm in (1) and (2) with an l1-norm. Hence solutions of:
                                                        (                              P                1                            )                        ⁢                                                  ⁢                                          min                x                            ⁢                                                                                        x                                                        1                                ⁢                                                                  ⁢                subject                ⁢                                                                  ⁢                to                ⁢                                                                  ⁢                y                                              =          Dx                ,                            (        4        )            in the exact representation case, and
                                                        (                              P                                  1                  ,                  ε                                            )                        ⁢                                          min                x                            ⁢                                                                                        x                                                        1                                ⁢                                                                  ⁢                subject                ⁢                                                                  ⁢                to                ⁢                                                                  ⁢                                                                        y                    -                    Dx                                                                                                  ≤          ε                ,                            (        5        )            in the approximate one, lead to the BP representations. Solution of (4) amounts to linear programming, and thus there exists efficient solvers for such problems.
Recent research addressed the connection between the (P0) and (P1). The essential claims are quite similar to the ones of OMP, namely, if the signal representation to be found has fewer than (1/μ) non-zeros, the BP is guaranteed to succeed in recovering it. Similar results exist for the approximated case, proving that recovered representations are very close to the original sparse one in case of high sparsity.
Focal Under-determined System Solver (FOCUSS) is an approximating algorithm for finding the solutions of either (1) or (2), by replacing the l0-norm with an lp one for p≦1.
For the exact case problem, (P0), this method requires solving
                                          (                          P              p                        )                    ⁢                                          ⁢                                    min              x                        ⁢                                                                              x                                                  p                            ⁢                                                          ⁢              subject              ⁢                                                          ⁢              to              ⁢                                                          ⁢              y                                      =                  Dx          .                                    (        6        )            
The use of a Lagrange multiplier vector λε here yields the Lagrangian function(x,λ)=∥x∥p+λT(y−Dx).  (7)
Hence necessary conditions for a pair x, λ to be a solution of 6 arex(x,λ)=pπ(x)x−DTλ=0 and λ(x,λ)=Dx−y=0,  (8)where π(x) is defined to be a diagonal matrix with |xi|p−2| as its (i, i)th entry. The split of the lp-norm derivative into a linear term multiplied by a weight matrix is the core of the FOCUSS method, and this follows the well-known idea of iterated reweighed least-squares. Several simple steps of algebra leads to the solution:x=π(x)−1DT(Dπ(x)−1DT)−1y.  (9)
While it is impossible to get a closed form solution for x from the above result, an iterative replacement procedure can be proposed, where the right hand side is computed based on the currently known xk−1, and this leads to the updating process,xk=π(xk−1)−1DT(Dπ(xk−1)−1DT)−1y.  (10)
A regularization can, and should, be introduced to avoid near-zero entries in the weight matrix π(x).
For the treatment of (P0,ε) via the (Pp,ε) parallel expressions can be derived quite similarly, although in this case the determination of the Lagrange multiplier is more difficult and must be searched within the algorithm.
Recent work analyzed the (Pp) problem and showed its equivalence to the (P0), under conditions similar in flavor to the sparsity conditions mentioned above. Hence, this approach too enjoys the support of some theoretical justification, like BP and OMP. However, the analysis does not say anything about local minima traps and prospects in hitting those in the FOCUSS-algorithm.
Design of Dictionaries: Prior Art
There has been some work in the field regarding the training of dictionaries based on a set of examples. Given such set Y={yi}i=1N, we assume that there exists a dictionary D that gave rise to the given signal examples via sparse combinations, i.e., we assume that there exists D, so that solving (P0) for each example yk gives a sparse representation xk. It is in this setting that the question is raised what the proper dictionary D is.
A. Generalizing the K-Means
There is an intriguing relation between sparse representation and clustering (i.e., vector quantization). In clustering, a set of descriptive vectors {dk}k=1K is learned, and each sample is represented by one of those vectors (the one closest to it, usually in the l2 distance measure). One can think of this as an extreme sparse representation, where only one atom is allowed in the signal decomposition, and furthermore, the coefficient multiplying it must be 1. There is a variant of the vector quantization (VQ) coding method, called Gain-Shape VQ, where this coefficient is allowed to vary. In contrast, in sparse representations relevant to the invention, each example is represented as a linear combination of several vectors {dk}k=1K. Thus, sparse representations can be referred to as a generalization of the clustering problem.
Since the K-Means algorithm (also known as generalized Lloyd algorithm—GLA) is the most commonly used procedure for training in the vector quantization setting, it is natural to consider generalizations of this algorithm when turning to the problem of dictionary training. The K-Means process applies two steps per each iteration: (i) given {dk}k=1K, assign the training examples to their nearest neighbor; and (ii) given that assignment, update {dk}k=1K to better fit the examples.
The approaches to dictionary design that have been tried so far are very much in line with the two-step process described above. The first step finds the coefficients given the dictionary—a step we shall refer to as “sparse coding”. Then, the dictionary is updated assuming known and fixed coefficients. The differences between the various algorithms that have been proposed are in the method used for the calculation of coefficients, and in the procedure used for modifying the dictionary.
B. Maximum Likelihood Methods
Maximum likelihood methods use probabilistic reasoning in the construction of D. The proposed model suggests that for every example y the relationy=Dx+v,  (11)holds true with a sparse representation x and Gaussian white residual vector v with variance σ2. Given the examples Y={yi}i=1N these works consider the likelihood function P (Y|D) and seek the dictionary that maximizes it. Two assumptions are required in order to proceed—the first is that the measurements are drawn independently, readily providing
                              P          ⁡                      (                          Y              |              D                        )                          =                              ∏                          i              =              1                        N                    ⁢                                    P              ⁡                              (                                                      y                    i                                    |                  D                                )                                      .                                              (        12        )            
The second assumption is critical and refers to the “hidden variable” x. The ingredients of the likelihood function are computed using the relationP(yi|D)=∫P(yi,x|D)dx=∫P(yi|x·D)·P(x)dx.  (13)
Returning to the initial assumption in (11), we have
                              P          ⁡                      (                                                            y                  i                                |                x                            ,              D                        )                          =                              Const            ·            exp                    ⁢                                    {                                                1                                      2                    ⁢                                          σ                      2                                                                      ⁢                                                                                                Dx                      -                                              y                        i                                                                                                  2                                            }                        .                                              (        14        )            
The prior distribution of the representation vector x is assumed to be such that the entries of x are zero-mean iid, with Cauchy or Laplace distributions. Assuming for example a Laplace distribution we get
                                                                        P                ⁡                                  (                                                            y                      i                                        |                    D                                    )                                            =                              ∫                                                                            P                      ⁡                                              (                                                                                                            y                              i                                                        |                            x                                                    ,                          D                                                )                                                              ·                                          P                      ⁡                                              (                        x                        )                                                                              ⁢                                      ⅆ                    x                                                                                                                          =                              Const                ·                                  ∫                                      exp                    ⁢                                                                  {                                                                              1                                                          2                              ⁢                                                              σ                                2                                                                                                              ⁢                                                                                                                                                  Dx                                -                                                                  y                                  i                                                                                                                                                    2                                                                          }                                            ·                      exp                                        ⁢                                          {                                              λ                        ⁢                                                                                                          x                                                                                1                                                                    }                                        ⁢                                          ⅆ                      x                                                                                                                              (        15        )            
This integration over x is difficult to evaluate, and indeed, it has been handled by replacing it with the extremal value of P(yi,x|D). The overall problem turns into
                                                        D              =                              arg                ⁢                                                                  ⁢                                                      max                    D                                    ⁢                                                            ∑                                              i                        =                        1                                            N                                        ⁢                                                                  max                                                  x                          i                                                                    ⁢                                              {                                                  P                          ⁡                                                      (                                                                                          y                                i                                                            ,                                                                                                x                                  i                                                                |                                D                                                                                      )                                                                          }                                                                                                                                                                    =                              arg                ⁢                                                                  ⁢                                                      min                    D                                    ⁢                                                            ∑                                              i                        =                        1                                            N                                        ⁢                                                                  min                                                  x                          i                                                                    ⁢                                              {                                                                                                                                                                                                            Dx                                  i                                                                -                                                                  y                                  i                                                                                                                                                    2                                                    +                                                      λ                            ⁢                                                                                                                                                            x                                  i                                                                                                                            1                                                                                                      }                                                                                                                                                    (        16        )            
This problem does not penalize the entries of D as it does for the ones of xi. Thus, the solution will tend to increase the dictionary entries' values, in order to allow the coefficients to become closer to zero. This difficulty has been handled by constraining the l2-norm of each basis element, so that the output variance of the coefficients is kept at an appropriate level.
An iterative method was suggested for solving (16). It includes two main steps in each iteration: (i) calculate the coefficients xi using a simple gradient descent procedure; and then (ii) update the dictionary using
                              D                      (                          n              +              1                        )                          =                              D                          (              n              )                                -                      η            ⁢                                          ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                                                (                                                                                    D                                                  (                          n                          )                                                                    ⁢                                              x                        i                                                              -                                          y                      i                                                        )                                ⁢                                  x                  i                  T                                                                                        (        17        )            
This idea of iterative refinement, mentioned before as a generalization of the K-Means algorithm, was later used again by other researchers, with some variations.
A different approach to handle the integration in (15) has been suggested. It consisted in approximating the posterior as a Gaussian, enabling an analytic solution of the integration. This allows an objective comparison of different image models (basis or priors). It also removes the need for the additional re-scaling that enforces the norm constraint. However, this model may be too limited in describing the true behaviors expected. This technique and closely related ones have been referred to as approximated ML techniques.
There is an interesting relation between the maximum likelihood method and the Independent Component Analysis (ICA) algorithm. The latter handles the case of a complete dictionary (the number of elements equals the dimensionality) without assuming additive noise. The maximum likelihood method is then similar to ICA in that the algorithm can be interpreted as trying to maximize the mutual information between the inputs (samples) and the outputs (the coefficients).
C. The Method of Optimal Directions
The Method of Optimal Directions (MOD), a dictionary-training algorithm, follows more closely the K-Means outline, with a sparse coding stage that uses either the OMP or FOCUSS, followed by an update of the dictionary. The main contribution of the MOD method is its simple way of updating the dictionary.
Assuming that the sparse coding for each example is known, we define the errors ei=yi−Dxi. The overall representation mean square error is given by∥E∥F2=∥[e1,e2, . . . ,eN]∥F2=∥Y−DX∥F2.  (18)Here we have concatenated all the examples yi as columns of the matrix Y, and similarly gathered the representations coefficient vectors xi to build the matrix X. The notation ∥A∥F stands for the Frobenius Norm, defined as ∥A∥F=√{square root over (ΣijAij2)}.
Assuming that X is fixed, we can seek an update to D such that the above error is minimized. Taking the derivative of (10) with respect to D we obtain the relation (Y−DX)XT=0, leading toD(n+1)=YX(n)T·(X(n)X(n)T)−1  (19)
In updating the dictionary, the update relation given in (19) is the best that can be achieved for fixed X. The iterative steepest descent update in (17) is far slower. Interestingly, in both stages of the algorithm, the difference is in deploying a second order (Newtonian) update instead of a first-order one. Looking closely at the update relation in (17), it could be written as
                                                                        D                                  (                                      n                    +                    1                                    )                                            =                            ⁢                                                D                                      (                    n                    )                                                  +                                  η                  ⁢                                                                          ⁢                                      EX                                                                  (                        n                        )                                            T                                                                                                                                              =                            ⁢                                                D                                      (                    n                    )                                                  +                                                      η                    ⁡                                          (                                              Y                        -                                                                              D                                                          (                              n                              )                                                                                ⁢                                                      X                                                          (                              n                              )                                                                                                                          )                                                        ⁢                                      X                                                                  (                        n                        )                                            T                                                                                                                                              =                            ⁢                                                                    D                                          (                      n                      )                                                        ⁡                                      (                                          I                      -                                              η                        ⁢                                                                                                  ⁢                                                  X                                                      (                            n                            )                                                                          ⁢                                                  X                                                                                    (                              n                              )                                                        T                                                                                                                )                                                  +                                  η                  ⁢                                                                          ⁢                                                            YX                                                                        (                          n                          )                                                T                                                              .                                                                                                          (        20        )            
Using infinitely many iterations of this sort, and using small enough η, this leads to a steady state outcome, which is exactly the MOD update matrix (19). Thus, while the MOD method assumes known coefficients at each iteration, and derives the best possible dictionary, the ML method by Olshausen and Field only gets closer to this best current solution, and then turns to calculate the coefficients. Note, however, that in both methods a normalization of the dictionary columns is required and done.
D. Maximum A-Posteriori Probability Approach
The same researchers that conceived the MOD method also suggested a maximum a-posteriori probability (MAP) setting for the training of dictionaries, attempting to merge the efficiency of the MOD with a natural way to take into account preferences in the recovered dictionary. This probabilistic point of view is very similar to the ML methods discussed above. However, rather than working with the likelihood function P(Y|D), the posterior P(D|Y) is used. Using Bayes rule, we have P(D|Y)αP(Y|D)P(D), and thus we can use the likelihood expression as before, and add a prior on the dictionary as a new ingredient.
Research currents considered several priors P(D) and per each proposed an update formula for the dictionary. The efficiency of the MOD in these methods is manifested in the efficient sparse coding, which is carried out with FOCUSS. The proposed algorithms in this family deliberately avoid a direct minimization with respect to D as in MOD, due to the prohibitive n×n matrix inversion required. Instead, iterative gradient descent is used.
When no prior is chosen, the update formula is the very one used in (17). A prior that constrains D to have a unit Frobenius norm leads to the update formulaD(n+1)−D(n)+ηEXT+η·tr(XETD(n))D(n).  (21)
As can be seen, the first two terms are the same ones as in (17). The last term compensates for deviations from the constraint. This case allows different columns in D to have different norm values. As a consequence, columns with small norm values tend to be under-used, as the coefficients they need are larger and as such more penalized.
This led to the second prior choice, constraining the columns of D to have a unit l2-norm. The new update equation formed is given bydi(n+1)=di(n)+η(I−di(n)di(n)T)E·iT,  (22)where xTi is the i-th column in the matrix XT.
Compared to the MOD, this line of work provides slower training algorithms.
E. Unions of Orthogonal Bases
Recent work considered a dictionary composed as a union of orthonormal basesD=[D1;D2, . . . ,DL],where Djεn×n, j=1, 2, . . . , L are orthonormal matrices. Such a dictionary structure is quite restrictive, but its updating may potentially be made more efficient.
The coefficients of the sparse representations X can be decomposed to L pieces, each referring to a different ortho-basis. Thus,X=[X1,X2, . . . ,XL]T,where Xi is the matrix containing the coefficients of the orthonormal dictionary Di.
One of the major advantages of the union of ortho-bases is the relative simplicity of the pursuit algorithm needed for the sparse coding stage. The coefficients are found using the Block Coordinate Relaxation (BCR) algorithm. This is an appealing way to solve (P1,ε) as a sequence of simple shrinkage steps, such that at each stage Xi is computed, while keeping all the other pieces of X fixed. Thus, this evaluation amounts to a simple shrinkage.
Assuming known coefficients, the proposed algorithm updates each orthonormal basis Dj sequentially. The update of Dj is done by first computing the residual matrix
      E    j    =            [                        e          1                ,                  e          2                ,        …        ⁢                                  ,                  e          N                    ]        =          Y      -                        ∑                      i            ≠            j                                                          ⁢                                  ⁢                              D            i                    ⁢                                    X              i                        .                              
Then, by computing the singular value decomposition of the matrix EjXTj=UΛVT, the update of the j-th ortho-basis is done by Dj=UVT. This update rule is obtained by solving a constrained least squares problem with ∥Ej−DjXj∥F2 as the penalty term, assuming fixed coefficients Xj and error Ej. The constraint is over the feasible matrices Dj, which are forced to be orthonormal.
This way the proposed algorithm improves each matrix Dj separately, by replacing the role of the data matrix Y in the residual matrix Ej, as the latter should be represented by this updated basis.
Grinbonval suggested a slightly different method. Apart from the fact that here the dictionary is structured, handling a union of orthonormal bases, it updates each orthonormal bases sequentially, and thus reminds the sequential updates done in the K-means. Experimental results show weak performance compared to previous methods. This could partly be explained by the fact that the update of Dj depends strongly on the coefficients Xj.
K-Means Algorithm for Vector Quantization (VQ)
In VQ, a codebook C that includes K codewords (representatives) is used to represent a wide family of vectors (signals) Y={yi}i=1N (NK) by a nearest neighbor assignment. This leads to an efficient compression or description of those signals, as clusters in  surrounding the chosen codewords. Based on the expectation maximization procedure, the K-Means can be extended to suggest a fuzzy assignment and a covariance matrix per each cluster, so that the data is modeled as a mixture of Gaussians.
The dictionary of VQ codewords is typically trained using the K-Means algorithm. We denote the codebook matrix by C=[c1, c2, . . . , cK], the codewords being the columns. When C is given, each signal is represented as its closest codeword (under l2-norm distance). We can write yi=Cxi, where xi=ej is a vector from the trivial basis, with all zero entries except a one in the j-th position. The index j is selected such that∀k≠j∥yi−Cej∥22≦∥yi−Cek∥22.
This is considered as an extreme case of sparse coding in the sense that only one atom is allowed to participate in the construction of yi, and the coefficient is forced to be 1. The representation MSE per yi is defined asei2=∥yi−Cxi∥22.  (23)and the overall MSE is
                    E        =                                            ∑                              i                =                1                            K                        ⁢                          e              i              2                                =                                                                                      Y                  -                  CX                                                            F              2                        .                                              (        24        )            The VQ training problem is to find a codebook C that minimizes the error E, subject to the limited structure of X, whose columns must be taken from the trivial basis,
                                          min                          C              ,              X                                ⁢                                    {                                                                                      Y                    -                    CX                                                                    F                2                            }                        ⁢                                                  ⁢            subject            ⁢                                                  ⁢            to            ⁢                                                  ⁢                          ∀              i                                      ,                              x                          i              ⁢                                                                            =                                    e              k                        ⁢                                                  ⁢            for            ⁢                                                  ⁢            some            ⁢                                                  ⁢                          k              .                                                          (        25        )            
The K-Means algorithm is an iterative method used for designing the optimal codebook for VQ. In each iteration there are two stages—one for sparse coding that essentially evaluates X, and one for updating the codebook.
The sparse coding stage assumes a known codebook C(J−1), and computes a feasible X that minimizes the value of (25). Similarly, the dictionary update stage fixes X as known, and seeks an update of C so as to minimize (25). Clearly, at each iteration either a reduction or no change in the MSE is ensured. Furthermore, at each such stage, the minimization step is optimal under the assumptions. As the MSE is bounded from below by zero, and the algorithm ensures a monotonic decrease of the MSE, convergence to at least a local minimum solution is guaranteed. Stopping rules for the above-described algorithm can vary a lot but are quite easy to handle.