The data sizes associated with analytical applications continuously increase, and many data scientists are switching from customized micro-solutions to scalable alternatives (e.g., statistical and scientific databases). However, many data mining algorithms may be expressed in terms of linear algebra, which is barely supported by major database vendors and big data solutions. Moreover, conventional linear algebra algorithms and legacy matrix representations are often not suitable for very large matrices.
Recently, big data matrices have appeared in many analytical applications of science and business domains. These include solving linear systems, principal component analysis, clustering and similarity-based applications, such as non-negative matrix factorization in gene clustering, as well as algorithms on large graphs (e.g., multi-source breadth-first-search). A very common—and often substantially expensive—operation that is involved in all of the aforementioned applications is matrix multiplication, where at least one matrix is usually large and sparse. As an example, consider a similarity query associated with text mining: a term-document matrix (A)ij that contains the frequency of terms j for every document i is multiplied with its transpose to obtain the cosine similarity matrix of documents D=AAT. Similarly, in gene clustering, the core computation contains iterative multiplications of the large, sparse gene expression matrix V with a dense matrix HT.
Such applications used to be custom-coded by data analysts in solutions on a small scale, often using numerical frameworks like R or Matlab. However, the growth of data volume, and the increasing complexity of modern hardware architectures, has driven scientists to shift from handcrafted implementations to scalable alternatives, such as massively parallel frameworks and database systems. While frameworks like R or Matlab may provide a suitable language environment to develop mathematical algorithms, such implementations are not out-of-the-box scalable. As a consequence, a scalable system that provides a basic set of efficient linear algebra primitives may be desirable. Some systems may provide a R (or R-like) interface and deep integrations of linear algebra primitives, such as sparse matrix-matrix and matrix-vector multiplications. Further note that with the decrease in Random Access Memory (“RAM”) prizes, and the corresponding scale up in main memory systems, it has become feasible to run basic linear algebra algorithms directly on big data sets that reside in an in-memory column store. However, for most frameworks and libraries the user is required to predefine the final data structure of a matrix. Furthermore, matrices may be stored as a whole in either a sparse, or a dense static format, resulting in poor memory utilization and processing performance when the data representation is not chosen wisely.