Many critical data applications include irregular computations, such as computations over sparse data or computations in which what particular operations are to be performed may be determined at run time, based on the values of one or more data elements. As an example, many real-world data analysis applications use multi-linear algebraic (or tensor) computations, such as tensor decompositions, for performing analysis of multi-dimensional real-world data. The multi-dimensional arrays representing real data are usually very large and can be sparse making the tensor computations performed over them irregular. These large, sparse multi-dimensional arrays (generally called sparse tensors) pose significant challenges to optimizing computations performed with them. The corresponding irregular sequence of operations (also called an irregular code) is, in general, harder to parallelize than a regular sequence of operations at least in part because the amount of parallelism may be dependent on the input data that can be known only at run time. Further, the irregular codes are usually memory-bound and the processor often spend substantial time in accessing the required memory locations. The fact that data locality optimizations may depend on the input data, which may become available only at run time, can further complicate the design of locality, parallelization, and/or other optimizations that may be used to improve the performance of irregular codes. Additionally, the volume of data processed in critical data applications are growing in size, adding to the complexity.
Multi-core architectures, which can include several cores within a single chip, and several cores across different chips in one or more packages, and multi-processor architectures are becoming increasingly popular. In this disclosure, multi-core systems generally include multi-processor systems. These system can facilitate parallel execution of computational tasks, and may direct current and future system designs. Optimizing irregular computations on modern multi-core systems is therefore an important problem to address. Memory issues may highly affect high-performance software design and these issues are usually complicated by the multi-core trend. For performance reasons, some modern multi-core systems have a complex hierarchy of memory units exhibiting a Non-Uniform Memory Access (NUMA) model in which there is a great diversity in latency across different memory units, which may include off-chip DRAM and several levels of on-chip and/or off-chip caches. Typically in a multi-core system, there are significant distances between cores and each core has one or more local caches that are closer to the core and have lower latency than the remote caches. Further, cores may be placed in distinct CPU sockets and that may introduce additional disparity in latency. These factors can affect the mapping of a software application to the available hardware system using several cores and having a hierarchical memory subsystem for maximizing the performance of the software system, e.g., maximizing the speed of execution, minimizing energy and/or power consumption, etc.
Various researchers have studied memory issues with NUMA. Also, there has been significant amount of work on optimizing sparse matrix computations—most of the optimizations target performance improvements at various memory levels in memory hierarchy including an optimized data structure for storing the sparse matrix, exploiting block structures in sparse matrix, and blocking for reuse at the level of cache, TLB, and registers. These efforts, however, do not address NUMA issues arising in sparse “multi-dimensional array” (e.g., having more than two dimensions) or sparse tensor computations wherein the challenges are significantly different from those arising from two-dimensional sparse matrix computations.
NUMA-aware graph mining techniques have been presented for graph problems such as Betweenness Centrality detection. Tensor computations may be considered to be “hypergraph” or “multi-link graph” computations, similar to graph computations, but the techniques applied for optimizing graph computations generally do not handle multiple access patterns arising from the same input data for different “mode” computations, which are commonly involved in Tensor computations. Some hypergraph partitioning techniques are well suited for partitioning sparse matrix computations. However, they would likely introduce costly overhead for sparse tensor computations. Another technique described in a co-pending application (U.S. patent application Ser. No. 13/898,159, entitled “Efficient and Scalable Computations with Sparse Tensors”), which is incorporated herein by reference in its entirety, describes new sparse tensor storage formats for improving the performance of sparse tensor computations, which may be complimentary to the embodiments described herein. Those systems and methods can improve single-node performance. Techniques to handle computations with very large sparse tensors on a distributed memory environment have also been presented, but these techniques do not take into account modern distributed memory clusters that include clusters of multi-core and/or multi-processor systems.
Irregular computations are often associated with sparse tensors. A tensor is a multi-dimensional array and the order of a tensor is the number of dimensions of the array, also called modes of the tensor. Two popular and prominent tensor decompositions are CANDECOMP/PARAFAC (CP) and Tucker decompositions. CP decomposition and two widely used algorithms to determine CP tensor decomposition are described below.
CP Decomposition: The CP tensor factorization decomposes a tensor into a sum of component rank-one tensors (A N-way e.g., N-mode tensor is called a rank-one tensor if it can be expressed as an outer product of N vectors). The CP decomposition that factorizes an input tensor χ of size I1× . . . ×IN into R components (with factor matrices A(1) . . . A(N) and weight vector λ) is of the form:
  x  =            ∑              r        =        1            R        ⁢                  λ        r            ⁢                        a          r                      (            1            )                          ∘        …        ∘                  a          r                      (            N            )                              where ar(n) represents the rth column of the factor matrix A(n) of size In×R.
CP-ALS Algorithm: The widely used workhorse algorithm for CP decomposition is the alternating least squares (ALS) method. The CP-ALS algorithm is presented in Algorithm 1 depicted in FIG. 1.
CP-APR Algorithm: An Alternate Poisson Regression (APR) fitting algorithm for the non-negative CP tensor decomposition which is based on a majorization-minimization approach has also been developed. One embodiment of the CP-APR algorithm is presented in Algorithm 2, depicted in FIG. 2.
The main computations involved in the above algorithms include:                Tensor mode-n matricization or representing the N-dimensional tensor as a two dimensional matrix, χ→X(n)         Matrix transpose        Inverse of a diagonal matrix        Pseudo-inverse of a matrix, V†        Matrix matrix multiplication, C=A B        Matrix Khatri-Rao product, C=A⊙B        Matrix element-wise division, C=AØB        Matrix element-wise product (Hadamard product), C=A*B        
When the input tensor is dense, the above-mentioned computations are all dense and regular. The R-Stream™ compiler has been well established as an effective source-to-source high level optimizing compiler to parallelize and optimize such dense regular computations. However when the input tensor is sparse, the computations involving (and affected by the) sparsity of the input tensor are also generally sparse and irregular. Examples of sparse computations include but are not limited to:                ‘Sparse’ matricized tensor times Khatri-Rao product (mttkrp) computation in CP-ALS algorithm, represented as:X(n)(A(N)⊙ . . . ⊙A(n+1)⊙A(n−1)⊙ . . . A(1))        Π computation in CP-APR algorithm, represented as:Π=(A(N)⊙ . . . ⊙A(n+1)⊙A(n−1)⊙ . . . A(1))T         Φ computation in CP-APR algorithm, represented as:Φ=(X(n)Ø(BΠ))ΠT         
In these computations, it can be assumed that the N-mode input tensor is stored in co-ordinate format, in which each non-zero element is represented by a vector of N indices and the non-zero value. A value is designated to be zero is the magnitude of the value is actually zero and/or if the magnitude of the value is less than a specified threshold. The threshold can be provided as a percentage of a maximum value. The values that are not designated as zero values are non-zero values. Let nnz be the number of non-zeroes in the input tensor. Let IND be a matrix of size nnz×N, representing the indices of all non-zero elements. Let VAL be a nnz×1 vector of non-zero values of the input tensor.
Computation of Π in CP-APR: In the case of a dense input tensor, Π would be a dense matrix of size I1 . . . In−1 In+1 . . . IN×R. However, in the sparse tensor case, since only the elements that correspond to the non-zero indices of the input tensor are used further, the size of Π needs to be at most nnz×R. The sparse computation of Π for a mode n is depicted in FIG. 3A. It is important to note that the computation of Π is generally different for each mode of the tensor. The computation may involve different arrays and/or different ‘rows’ of arrays depending on the non-zero indices of the sparse tensor.
Computation of Φ in CP-APR: The sparse computation of Φ for a mode n is depicted in FIG. 3B. Similar to the computation of Π, the computation of Φ can also be different for each mode of the tensor. This computation may involve different output arrays and/or a different pattern of accessing the non-zeroes of the input sparse tensor.
Computation of sparse mttkrp in CP-ALS}: The sparse computation of mttkrp for a mode n is depicted in FIG. 4. Like the previous computations, the computation of mttkrp can also be different for each mode of the tensor and may involve different output arrays, different input arrays, and/or a different pattern of accessing the non-zeroes of the input sparse tensor.
Improving the parallel performance of sparse tensor computations often encounters the following challenges. First, the amount of parallelism is usually dependent on the non-zero pattern of the input sparse tensor data that may become known only at run time. Second, tensor computations generally have “mode-specific operations,” i.e., operations that access data with respect to the orientation of a mode or dimension of the multi-dimensional data. The presence of mode-specific operations can make optimizations harder as the parallelism and data locality characteristics for computations along one mode may differ from those of another mode for the same input data. Third, iterative repetition of mode-specific operations in sparse tensor codes can make the process of applying optimizations specific to a particular mode ineffective, because the liveness of the optimizations tends to become limited.
Some known techniques for sparse computations are often not effective and are not directly applicable for sparse tensor computations. For example, data reordering is a known technique for improving cache locality in irregular codes with sparse computations. While this technique is still useful for sparse tensor computations, reordering the input data once is not sufficient as computations along each mode require a different effective reordering and the reordering is complicated by the fact, as mentioned above, that there is usually an iterative repetition of mode-specific operations.