1. Field
The present disclosure relates to the calculation of non-associative operations. In particular, it relates to methods and devices to perform the calculation in parallel for floating-point arithmetic, for example IEEE floating-point arithmetic.
2. Related Art
Scientific computing applications rely upon floating-point arithmetic for numerical calculations. For portability, almost all applications use the industry-standard floating-point representation IEEE-754[1] that provides uniform semantics for operations across a wide range of machine implementations. Each IEEE floating-point number has a finite-precision mantissa and a finite-range exponent specified by the standard, and the standard defines correct behavior for all operations and necessary rounding. The finite precision and range of floating-point numbers in the IEEE format requires rounding in the intermediate stages of long arithmetic calculations. This limited precision representation makes floating-point arithmetic non-associative.
As an illustrative example, FIGS. 1A and 1B show a case where associativity does not hold. If associativity held, one could perform the calculation either sequentially (FIG. 1A) or using a balanced reduce tree (FIG. 1B) and obtain the same result.
However, as FIGS. 1A and 1B show, the two different associations yield different results. For portability and proper adherence to the IEEE floating-point standard, if the program specifies the sequential order, the highly parallel, balanced “reduce” tree implementation would be noncompliant; it would produce incorrect results for some sets of floating-point values.
Consequently, portable floating-point computations must always be performed strictly in the order specified by the sequential evaluation semantics of the programming language. This makes it impossible to parallelize most floating-point operations without violating the standard IEEE floating-point semantics. This restriction is particularly troublesome since the pipeline depths of high-performance floating-point arithmetic units is tens of cycles, meaning common operations, such as floating-point accumulation, cannot take advantage of the pipelining, but end up being limited by the latency of the floating-point pipeline rather than its throughput.
For example, Conjugate Gradient (CG) is a scientific computing application whose parallelism can be severely limited by sequential accumulation. CG is a popular iterative numerical technique for solving a sparse, linear system of equations represented by A×x=b, where A is a square n×n matrix and x and b are vectors of length n. Sparse Matrix-Vector Multiply (SMVM) is the dominant computation kernel in CG. In SMVM, one computes dot products between the rows of A and x which effectively requires one to sum the products of the non-zero matrix values with their corresponding vector entries in x. For sparse graphs, the number of non-zero entries per row can be unbalanced, with average rows requiring sums of only 50-100 products, and exceptional rows requiring much larger sums. If each dot product sum must be sequentialized, the size of the largest row can severely limit the parallelism in the algorithm and prevent good load balancing of the dot products. In addition to these dot-product sums, a typical CG iteration requires a few global summations with length equal to the size of the vectors, n. For large numerical calculations, n can easily be 104, 105 or larger; if these summations must be serialized, they can become a major performance bottleneck in the task, limiting the benefits of parallelism.
In view of the above, there is a need for a method that would allow parallelizing most floating-point operations without violating the standard IEEE floating-point semantics.