1. Field of Invention
In general, the invention relates a method of, and apparatus for, processing a computation on a computing device. More particularly, the invention relates to a method of, and apparatus for, mitigating memory bandwidth limitations when performing numerical calculations.
2. Background of Technology
Computer systems are often used to perform complex numerical calculations. One important area in which they are used is in the modelling of real-world conditions and environments; for example, the modelling of: weather systems, fluid propagating through a medium, the motion of stars in a galaxy, an explosive device, or seismic wave propagation.
Often, computer programs are run which create in the computer memory a multi-dimensional “mesh” of the area or region to be modelled. Commonly, techniques such as finite element analysis (FEM) and finite difference modelling (FDM) are used to model real-world environments.
Many modelling systems require linear systems to be solved. Linear systems comprise a set of linear equations involving the same set of variables. For example, equation 1) below outlines two simple linear equations:x+y=22x−y=1  1)
The solution to which is x=1 and y=1.
Generally, such linear systems can be represented using a matrix and two vectors, with the matrix hold the coefficients of the equations, one vector holding the unknowns (in this case, x and y), and the other vector holding the right hand side of the equations. For example, the previous example in equation 1) would be represented as shown in equation 2) below:
                                          (                                                            1                                                  1                                                                              2                                                                      -                    1                                                                        )                    ⁢                      (                                                            x                                                                              y                                                      )                          =                  (                                                    2                                                                    1                                              )                                    2        )            
This is commonly described using the form shown below in equation 3):Ax=b  3)
Where A is the matrix, x is a vector representing the unknowns, and b is the desired result vector. This system can now be solved using a solver to determine the value of the vector x (the unknown to be calculated) using a variety of algorithms.
A common type of linear system is one where the matrix is sparse (known as a sparse linear system). In a sparse matrix, most of the elements in the matrix are zero. For example, equation 4) illustrates a typical small sparse matrix:
                    (                                            1                                      0                                      0                                                      -                1                                                                        0                                      2                                      0                                      0                                                                          -                2                                                    0                                      3                                      0                                                          0                                      0                                      0                                      4                                      )                            4        )            
Sparse linear systems often arise from the numerical solution of differential equations such as those which arise during FEM and/or FDM approaches.
The two main classes of solver are direct and iterative solvers. Direct (or exact) solvers generally involve manipulating a matrix into a more easily solvable form such as, for example, a diagonal matrix. In contrast, iterative solvers generally start with an initial guess for the solution of x and then repeatedly perform a process to refine that guess towards an acceptable solution.
Iterative solvers repeatedly refine a “guess” x* for the solution, then perform a matrix-vector multiplication Ax* to compute the result b*. The residual between b* and the desired result b indicates how close the current “guess” is to the true solution x. Once the residual b*-b reaches some required tolerance value, the process can terminate.
One approach to the iterative solving of linear systems is the conjugate gradient method. The conjugate gradient method is an algorithm for the numerical solution of linear systems whose matrix is symmetric and positive definite.
The conjugate gradient method is known for solving large sparse systems, as it is much more efficient than most other solvers for this type of system. Sparse systems often arise when numerically solving partial differential equations (arising from FEM, for example).
In order to determine an appropriate estimated error (which is done to provide an indication of how close the calculation is to the actual solved value of x), an initial guess for the actual solution x* is made. This is x0. If no good initial guess is readily available x0=0 may be used. Starting with x0 for the solution, in each iteration a metric is required to determine whether the iteration is approaching the (currently unknown) solution x*. This metric comes from the fact that the solution x* is also the unique minimizer of a quadratic function.
In the conjugate gradient method, the “pseudo code” (i.e. the computer program flow) can now be written as:
r0 = b − Ax0p0 = r0λ0 = r0· r0for i = 1:n ci = Api βi = pi · ci αi = λi−1 / βi5) xi = xi−1 + αipi ri = ri−1 − αici λi = ri−1· ri−1 pi = ri−1 + λi / λi−1pi−1 if err(ri−1) < to/ then exit loopend
where r is the residual vector. p and λ are functions of r.
The disadvantage for iterative solvers is that although each iteration requires relatively little computational resource, generally at least one matrix vector multiplication is required per iteration. For large systems, many iterations are typically required (typically hundreds to tens of thousands) to converge to a usable solution. Consequently, acceleration of this computation can lead to significant reductions in the time required to converge to an acceptable solution.
One approach to increase the speed of a computer system for specialist computing applications is to use additional or specialist hardware accelerators. These hardware accelerators increase the computing power available and concomitantly reduce the time required to perform the calculations.
A suitable system for performing iterative calculations is a stream processing accelerator having a dedicated local memory. The accelerator may be, for example, located on an add-in card which is connected to the computer via a bus such as Peripheral Component Interconnect Express (PCI-E).
The bulk of the numerical calculations can then be handled by the specialised accelerator. Stream processor accelerators can be implemented using, for example, Field-Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs) and/or structured ASICs. Stream processors implemented as FPGAs generally provide much more computational power than a CPU and so are able to perform calculations more quickly than a CPU. In certain cases, such arrangement may increase the performance of highly parallel applications by over an order of magnitude or more.
For some computational applications, the size of the matrix used may be very large and often high precision matrices and vectors are required to obtain reasonable accuracy in the final solution. In some examples, a matrix may comprise 2×107 columns and 2×107 rows, leading to up to 4×1014 high precision data values which need to be inputted from/outputted to the accelerator local memory. For sparse matrices, many values are zero and do not need to be stored explicitly. However large sparse matrices may still consist of many gigabytes of data.
As a result, the volume of data required to be processed during a step of a calculation may be very large. Therefore, for data-intensive calculations, memory bandwidth may become the bottleneck in a step of an iterative calculation. This is particularly true for stream processors, where the upper limit on the speed of a linear system solver algorithm may be imposed by the bandwidth limitations between the stream processor and the local memory, rather than the speed at which the stream processor can perform calculations.
One approach to alleviate the above issue is to compress the data. In general, compression refers to methods which reduce the physical size of the data set. A number of different compression schemes are available. One scheme may be to reduce the precision of data values in a data set from double precision (which gives around 15 decimal places of precision) to single precision (which gives about 7 decimal digits of precision). Other compression schemes may include block coding, transform based methods such as wavelet or discrete cosine and predictive methods. These may be optionally combined with run-length encoding, Huffman coding or other forms of entropy coding.
However, in many cases, compressed data must be decompressed before it can be used in a numerical calculation. Therefore, in certain circumstances and for some processor architectures (such as, for example, a CPU architecture), the use of compressed data may actually lead to an increased computation time because of the need to perform a decompression operation. In other words, the computational cost of decompressing the compressed data on the CPU would outweigh any benefits of the reduced data volume.
In contrast, the highly parallel, pipelined, wide bus nature of an FPGA/stream processor accelerator can be used to perform high speed decompression of data with little or no speed penalty. Consequently, compression schemes to reduce the data volume in an iterative calculation are practicable for stream processor architectures and can lead to significant improvements in the speed at which calculations can be performed.
However, scientific data is typically compressed with “lossy” compression schemes which result in the compressed data being merely an approximation to the original data set. Therefore, a technical problem with the utilisation of compressed data is that iteratively solving a linear system using a compressed set of input data will lead to a solution to the linear system which is close to, but not the same as, the solution of the original uncompressed matrix. Therefore, the solution obtained using the compressed input data may only be considered approximate and may not be sufficiently accurate for many applications.
As an example, for an iterative solver to converge to within a given error tolerance, say within an error of 1×10−7 of the true solution, both the vectors and the matrix need to have at least enough precision to represent such a solution. Generally, double precision (which gives around 15 decimal places of precision) would be used when solving for error tolerances in the range 1×10−7 to 1×10−10 in software. Therefore, known compression systems may be unsuitable for such calculations.
It is known to utilise compressed data in computations. For example, US2010/0030932 relates to a method of obviating a bottleneck caused by the relatively slow transfer of data between a host computer and a peripheral processing device, by using data compression and decompression. The peripheral device is configured to decompress the data, process it, and compress the output before returning the output of a given algorithm.
“Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components”, Robert Strzodka and Dominik Göddeke, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), pages 259-268, April 2006 relates to the application of mixed precision methods for the conjugate gradient method. Different levels of precision are used for different parts of the algorithm, depending on where extra precision will lead to increased accuracy of the final solution.
The algorithm is split into a computationally intensive inner loop that runs in low precision, and an outer loop running in higher precision. The arrangement described in this document is directed towards reducing the silicon space required on an FPGA for computation.
An alternative technique for handling matrices is the multigrid method. Multigrid methods are a class of methods that use a sequence of related matrices with different sizes. Various different size representations of the same problem are used during the solve computation exploiting the fact that smaller size matrices can be solved faster, both because fewer operations are required per iteration and also because fewer iterations are required to converge to a solution.
The most common method is to perform a sequence of so called V-cycles, where one or more iterations are performed with the largest matrix, then the next largest and so on until the smallest matrix has been utilised. Then the process works back up again performing iterations on matrices getting gradually larger until the original matrix is used once again. Many V-cycles may be required to reach the solution. However, these methods require several different matrices which have to be computed and processed, and so may be inefficient to process.
To date, known numerical computation methods and associated hardware have suffered from the technical problem that memory bandwidth is the limiting factor in the acceleration of iterative linear solvers which are able to solve a linear system to a high degree of accuracy.