Large dynamic multidimensional datasets (“big data”) are common in a variety of fields. Exemplarily, such fields include finance, communication networking (i.e. protocols such as TCP/IP, UDP, HTTP, HTTPS, SCADA and cellular) and streaming, social networking, imaging, databases, e-mails, governmental database and critical infrastructures. In these, MDPs are accumulated constantly. A main goal in processing big data is to understand it and to extract intelligence from it. Big data can be described by hundreds or thousands of parameters (features). Consequently, in its original form in a source metric space, big data is incomprehensible to understand, to process, to analyze and to draw conclusions from.
Dimensionality reduction methods that embed data from a metric space, (where only the mutual distances or “affinities” between MDPs are given) into a lower dimension (vector) space are known. One such method involves diffusion maps (“DM”), see R. R. Coifman and S. Lafon, “Diffusion Maps”, Applied and Computational Harmonic Analysis, 21:5-30, 2006. A kernel method such as DM assigns distances between MDPs. These distances quantify the affinities between the MDPs. In the DM method, a diffusion operator is first formed on the MDPs. Spectral decomposition of the operator then produces from the data a family of maps in a Euclidean space. This is an “embedded” MDP matrix. The Euclidean distances between the embedded MDPs approximate the diffusion distances between the MDPs in the source metric space, i.e. the diffusion distance becomes the transition probability in t time steps from one MDP to another. In the MDP matrix, each row contains one MDP. A spectral decomposition of the MDP matrix, whose dimensions are proportional to the size of the data, has high computational costs. One problem is to determine how a new, ‘unseen’ sample (newly arrived MDP) can be mapped into a previously learnt or established embedded lower-dimension space. The DM procedure in particular cannot be repeated constantly for each newly arrived MDP.
Consider as an example a simple classification problem involving a set of training samples and a separate set of test samples, the latter used to check the validity of the classification. If one wishes to reduce the dimensionality of these datasets so that one can perform the classification in a lower-dimension space, one option is to combine the training and test sets into one “combined” dataset and to perform the coordinate computation on this combined dataset before splitting into two sets again in the low-dimension space, Another option is to run the algorithm on the training set only, then apply what has been learnt from this process to map the test set into the lower-dimension space. The advantage of the latter approach is that it is not only potentially less computationally expensive, but also that new samples can be continually added to the lower-dimension embedding without the need to re-compute the lower-dimension space. This approach is commonly referred to as the “out-of-sample extension” or “OOSE”.
In an OOSE problem, a new MDP needs to be mapped into a space that can be low-dimensional without affecting this space and without requiring a re-learning or change in the space parameterization for future learning. When the mapping is into a lower-dimension space it is also called “sampling” or “sub-sampling”. One way to perform OOSE is by using a geometric harmonics methodology (see. e.g. R. R. Coifman and S. Lafon, “Geometric Harmonics: A novel tool for multi-scale out-of-sample extension of empirical functions”, Applied and Computational Harmonic Analysis, 21(1), 31-52, 2006, referred to hereinafter as “GH”). Another way to perform OOSE is by using the Nystrom method (see C. T. H Baker, “The numerical treatment of integral equations”, Oxford: Calrendon Press 1977 and W. H. Presse et al., “Numerical Recipes in C”, Cambridge University Press, 2nd Edition, 1992, page 791-802, hereinafter “Press”).
The OOSE is performed on data in which the only known entities are the affinities between MDPs, as well as on empirical functions. The goal is to sub-sample big data and then to find the coordinates of a newly arrived MDP where only affinities between the source MDPs are given. The empirical functions (which may be for example either functions or mapping from one space to another such as embedding) are defined on MDPs and are employed for embedding newly arrived MDPs. The embedding occurs in a Euclidean space, determined once by a finite set of MDPs in a training process. The affinities between MDPs in an original source space (which form a training dataset) are converted into coordinates of locations in the embedded (Euclidean) space. The conversion of affinities of newly arrived MDPs into coordinates of locations in the embedded space is then done reliably and quickly without the need to repeat the entire computation, as done in training phase. To clarify, the training process is not repeated for each newly arrived MDP.
A numerical rank of a matrix is the number of numerically independent columns of the matrix. Suppose that l(s) is the numerical rank of a n×n Gaussian kernel matrix G(s) (EQ. 5) for a fixed scale s. To sub-sample correctly the data points, one needs to identify the l(s) columns in G(s) that constitute a well-conditioned basis for its numerical range. In other words, one needs to look for a n×l(s) matrix B(s) whose columns constitute a subset of the columns of G(s) and for a l(s)×n matrix P(s), such that l(s) of its columns make up an identity matrix and B(s)Ps≈G(s). Such matrix factorization is called interpolative decomposition (“ID”). The MDPs D3={xs1, xs2, . . . , xsl(s)} associated with the columns of B(s) constitute the sampled dataset at scale s.
FIG. 1A is a flow chart illustrating a known method for determining a well-conditioned basis of an input MDP matrix. The MDP matrix serves as an input to a sampling phase in step 110. The MDP matrix is assumed to have Gaussian entities, in particular as computed by EQ. 3 below. Deterministic single scale MDP sampling using a deterministic interpolative decomposition (DID) algorithm is performed on the input data in step 120. The DID algorithm may be based exemplarily on H. Cheng et al., “On the compression of low rank matrices”, SIAM J. Scientific Computing, Vol. 26, 1389-1404, 2005. The output of step 120—a basis of the input data—is used as an input to a randomized interpolative decomposition (RID) algorithm, step 130. The RID algorithm may be based exemplarily on P. G. Martinsson et al., “A randomized algorithm for the decomposition of matrices”, Applied and Computational Harmonic Analysis, Vol. 30 (1), 47-68, 2011. The output of the RID algorithm is a well-conditioned basis. Note that the use of DID and RID algorithms to generate a well-conditioned basis is exemplary, and the well-conditioned basis may be obtained in other ways. Details of the DID and RID algorithms are as follows: