Technical Field
The present disclosure relates to hierarchical sparse dictionary learning (“HiSDL”). More particularly, the present disclosure is related to a hierarchical sparse system and method to characterize given data, including high-dimensional time series, for outputting an interpretable dictionary which is adaptive to given data and generalizable by a priori dictionaries.
Description of the Related Art
Sparse coding plays a key role in high dimensional data analysis. In particular, sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. There is an advantage of having an over-complete bases in sparse coding to capture structures and patterns inherent in the input data, with the additional criterion of sparsity to resolve the degeneracy introduced by over-completeness.
Similarly, sparse representation has been proven to be very powerful in analyzing high dimensional signals, where typically each signal is represented as a linear combination of a few atoms in a given over-completed dictionary. For example, the sparse representation problem may be formulated as:
                                          w            ^                    =                                    arg              ⁢                                                min                  w                                ⁢                                                                                                  w                                                              0                                    ⁢                                                                          ⁢                                      s                    .                    t                    .                                                                                  ⁢                                                                                        Dw                        -                        x                                                                                                                                        ≤            σ                          ,                            (        1        )            where x is a signal vector such that xεRd, Rd is a vector of d real numbers, σ is a threshold value which may control the difference between Dw and x (e.g., a small positive number), D is a dictionary, ŵ is a the optimal estimation of w, and w is a pursued sparse code. The pursued sparse code w may be considered a robust representation of x, and therefore can be used for clustering, classification, and denoising. It is noted that variables defined herein have the same meaning throughout unless otherwise indicated.
Generally, there are two major approaches to construct an over-completed dictionary that is suitable for sparse representation, namely an analytic-based approach and a learning-based approach. In an analytic-based approach, the dictionary is carefully designed a priori, e.g., with atoms such as wavelets, curvelets, and shealets. These handcrafted dictionaries are then applied to different signals. One of the advantages of the analytic-based approach is that the dictionary can be designed to be well-conditioned for stable representation, for instance, to have a better incoherence condition or restricted isometric property.
In a learning-based approach, the dictionary is learned from the given signals. Compared to the analytic approach, the learned dictionaries are usually more adaptive to the given signals, and therefore lead to a sparser and more robust representation. The learning-based approach outperforms analytic-based approaches in many tasks, such as denoising, classification, etc. However, the dictionary learning problem is non-convex, which is usually formulated as follows:
                                          {                                          D                ^                            ,                              W                ^                                      }                    =                                    arg              ⁢                                                min                                                            D                      ∈                      C                                        ,                    W                                                  ⁢                                                                                                                          X                        -                        DW                                                                                    F                    2                                    ⁢                                                                          ⁢                                      s                    .                    t                    .                                                                                  ⁢                                                                                          W                                                                    0                                                                                            ≤            k                          ,                            (        2        )            where X is the data set, {circumflex over (D)} is the optimal estimation of dictionary D, Ŵ is the optimal estimation of W, W is the data representation over the dictionary D, that is, after the dictionary D is learned, each data point can be represented as a combination of the dictionary atoms, and W represents the combination (e.g., the coding), k is the number of non-zero values in a matrix, and C is the constraint such that DεC. Therefore, under the learning-based approach, it is very difficult to find the global optimal solution (e.g., selection from a given domain which provides the highest or lowest value) or local optimum solution (e.g., selection for which neighboring selections yield values that are not greater or smaller) close enough when the function is applied.
Additional common problems associated with the prior approaches includes overfitting, which may occur when a statistical model describes random error or noise instead of the underlying relationship. Overfitting may lead to poor predictive performance and generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Prior approaches, therefore, do not provide a dictionary that is both adaptive to the given data and regularized by a priori hand-crafted over-completed dictionaries. Moreover, prior approaches fail to provide methods that can properly handle high dimensional and heterogeneous time series to derive meaningful information from them.