In many areas of science and engineering, complex problems, such as physical problems, are solved using mathematical models which are discretized over space. Various methods can be used to convert a continuous mathematical model into a discretized form, the most common being the Finite Difference, Finite Element and Finite Volume methods.
The finite volume method is commonly used in Computational Fluid Dynamics applications, and in related fields such as Oil Reservoir Simulation. This method splits a problem domain into grid blocks (cells) by superimposing a mesh or grid of some type. Fluid conservation equations are then constructed by considering the accumulation of fluids within a cell, and flow across the cell faces to neighbouring cells, or to external sinks or sources, such as oil wells. Mass accumulation and flow rates depend on fluid properties such as the pressure, temperature and composition in each grid block and its neighbours. For time varying problems, the model is advanced through time by finding the values of the fluid properties that simultaneously satisfy the conservation equations in every cell for each discrete time-step.
In general these equations are non-linear, and Newton's method (or similar) is used to reduce the problem to the repeated solution of simultaneous linear equations linking all the grid blocks.
Similar considerations apply to other space discretization methods: for example, a finite difference method with a seven-point stencil may be approached using solution techniques which are the same as those described here.
For small problems, direct solution methods, such as Gaussian elimination or Cholesky decomposition can be used to solve the linear equations. However, direct methods become prohibitively expensive in terms of both storage and computing time for larger problems, and iterative methods are the norm for this type of problem.
Many common methods for iterative solution of the linear equations depend critically on the use of a “preconditioner”—a fast procedure for obtaining a good approximate solution to the linear equations. Because the equations arise from a spatially discretized model, they have some distinctive features which can be exploited to allow specialized preconditioners, which are not available for general sparse matrices, to be used.
For example, the NF (Nested Factorization) algorithm is a preconditioner which is specifically designed to approximate the matrices which arise from space discretization on topologically rectangular meshes.
The method described here has some features in common with NF, but, unlike NF, is ideally suited for execution on massively parallel computer architectures, such as the Graphics Processing Unit (GPU) co-processing systems currently manufactured by companies such as NVIDIA and AMD. Like NF, the method can be used on topologically rectangular meshes, but, in addition, it is suitable for use in more complex cases, including prismatic grids, which are unstructured in 2 dimensions, but with a regular structure in the third.
One practical application for these techniques is in the area of oil reservoir simulation. Oil and gas are produced from porous underground rock formations containing both water and hydrocarbons. Fluids, including oil and gas, are extracted through wells, and wells are also used to inject fluids, such as water, steam, carbon dioxide or chemicals with a view to improving overall production—for example by pushing oil towards a well, or making sticky oil more mobile.
Engineers use oil reservoir simulators to understand what is happening in the reservoir and to explore alternative strategies for optimizing outcomes. The results of a simulation may determine when and where new wells are drilled, and how they are controlled.
Modelling of a reservoir typically proceeds through two phases—history matching (see FIG. 1) and prediction (see FIG. 2). In the history matching phase, the past production of a field and its wells is repeatedly modelled with variations to the geological model designed to improve the match between historical fact and simulation. Once an adequate match is obtained, prediction runs can be started. These runs explore the consequences of alternative operating plans, often extending for several decades into the future. After the chosen plan is put into operation, the model will be re-run from time to time to tune the match, and refine the predictions.
The operation of a typical reservoir simulator is summarized in FIG. 3, and described below:                1. The first step is to gather and/or read data describing the reservoir model defined by the reservoir engineer. Typically this comprises:                    Geological data from numerous sources, including seismic analysis, rock cores and well log analysis. The rock porosity and directional permeabilities are key variables, and often vary greatly across the reservoir. The location and characteristics of geological faults must also be specified.            Details of the computational grid; fine grids may give better and more detailed results, but increase computation time.            Fluid properties, such as viscosity, density and phase transition information. Relative permeabilities are used to characterize the mobility of different phases when multiple phases are present. Fluid properties also vary spatially.            Sufficient information to determine the initial state of the main solution variables; these variables will include pressure, and probably the saturations of oil, water and gas. Other variables may represent the hydrocarbon composition and temperature. There may be from two up to 20 or more independent variables in each grid block. The simulator will model the changes to these variables for every grid block through a series of discrete time-steps. This solution is an n-vector, referred to here as x. n is usually the number of grid blocks multiplied by the number of solution variables per grid block.            Additional data specifying detailed control and reporting options.                        2. Next the engineer's proposed schedule of operation is read in. This consists principally of the location and characteristics of all production and injection wells, and details of how they will be operated. For example, a well may be opened months or years after the start of the simulation, and its production or injection rates determined partly by conditions in the reservoir or constraints of surface facilities.        3. If the simulation through the schedule of operations has been completed, the simulator produces some final summary data and halts.        4. Select a time-step by which to advance. This may be anything from a few seconds to a year or more, depending on what is required by the schedule, and on numerical considerations.        5. Make an estimate (dx) of the change in the solution variables between the start and end of the time-step. dx, like x, is an n-vector.        6. Begin a second loop to solve the non-linear equations which describe mass conservation in each grid block.        7. Calculate the residuals—which reflect the degree to which material is not conserved for each grid block. For a given grid block, the residuals are found by calculating the amount of all fluids in the cell at the beginning and end of the time step. This difference must be made up by flows from neighbouring grid blocks and wells. Both quantities (fluid in place and flow in or out) depend on the solution variables (pressure, composition, temperature etc.) for the cell and its neighbours (x and x+dx).        8. Determine whether the residuals are sufficiently close to zero to indicate that local and global material balance has been achieved, and the non-linear iteration has converged.        9. If the iteration has converged, update the solution variables x=x+dx and return to step 3.        10. If the iteration has not converged, but the maximum permitted number of iterations has been reached, the current iteration is abandoned, and a solution is instead sought to a simpler problem. A shorter time step is selected, and processing returns to step 5.        11. If the non-linear equations have not converged, the simulator assembles a matrix (A) of derivatives using the current best estimate of the solution (x+dx). Each matrix coefficient represents the rate of variation of a residual (local material balance error) with respect to a solution variable. The resulting matrix equationA·ddx=r (where A is an n×n matrix, ddx is an n-vector of changes to the solution variables, and r is an n-vector of residuals) is a linearized version of the non-linear equations. The solution procedure is just an n-dimensional matrix form of Newton's method for finding the zeros of a one-dimensional function.        12. Solve the linear equations formed in step 11. This is often the most time consuming step in the procedure.        13. Update the current estimate of the change to the solution variables during the time stepdx=dx+ddx         and return to step 7.        
Oil reservoir simulators are extremely demanding applications which constantly push the boundaries of what computers can do. Engineers have always demanded more sophisticated physics (more variables per grid block) and finer meshes (more grid blocks) both of which trends increase computational requirements. In addition, convergence of the linear and non-linear equations to solution is not a forgone conclusion. In other similarly demanding fields, such as seismic analysis and graphical rendering, the introduction of “massively parallel” GPU style coprocessors has resulted in a step change in compute capability. However oil reservoir simulators do not generally use GPUs because the linear solvers they currently use (step 12 above), often dominate the overall computation time, and cannot easily be adapted to run reliably on massively parallel architectures.
Reservoir simulators often use an iterative linear solver based on GMRES, or a similar Krylov subspace method, with a preconditioner to accelerate convergence. See, for example, Saad, Yousef (2003) Iterative Methods for Sparse Linear Systems, Second Edition, published by SIAM, ISBN 0-89871-534-2, which contains an excellent summary of these methods.
The preconditioner (B) is an approximation to A chosen such that B−1·x can be calculated quickly for an arbitrary vector x. The choice of preconditioner is critical: with a poor choice, the iteration may be very slow, or may not converge at all. In extreme cases the simulator will fail, even with the best known solution methods.
In many cases, two stage solvers, based on CPR (Constrained Pressure Residual—see Wallis, J. R., Kendall, R. P., and Little, T. E. (1985) “Constrained Residual Acceleration of Conjugate Residual Methods”, SPE13536, presented at the SPE Symposium on Reservoir Simulation, Houston, Tex., Feb. 10-13, 1985) or a similar method have been found to be effective. The first stage is an outer iteration using the full set of equations, including several variables, such as pressure, temperature, composition etc. for every grid block. These outer iterations are relatively slow and the number required for convergence can be reduced by adding a second stage, in which we iterate towards the solution of a reduced set of equations, often with only one variable, such as pressure, for every grid block. Typically, an adequate solution for the overall system might require a dozen outer iterations, each of which includes a few inner iterations of the reduced system.
FIGS. 4-6 show how a preconditioner fits into the operation of a Krylov subspace iterative solver such as GMRES. The Figures also show how a second stage is added to the basic single stage method.
FIG. 4 illustrates a setup phase for solving linear equations. For example, the steps may be performed according to the following method:                1. The first step is to define the ordering of the matrix variables. The choice of preconditioner may affect the importance of this step.        2. Next, the preconditioner is factored into block upper and lower factors, such that, B=L·U. All the elements of L and U are calculated. In most practical preconditioners, L and U retain most or all of the sparsity of A, and most non-zeros are the same as the corresponding terms in A. In contrast, an exact factorization of A would have many non-zeros. Once this is done, the preconditioner can be applied easily to arbitrary vectors during the iteration phase.        3. An initial estimate of the solution (x0) is generated, or supplied by the calling program. Often x0 is set to 0. The initial residual, r0, corresponding to x0 is calculated.        4. For single stage solvers, we skip steps 5 to 8 and proceed directly to step 9.        5. Using CPR, or similar, compute the scaling factors required to form the reduced equations for the inner iteration.        6. Form the reduced coefficient matrix, C, by combing elements of A using the CPR scaling factors.        7. Define the ordering of the matrix variables for the reduced equations (c.f. Step 1).        8. Factorize the preconditioner, D, for the reduced equations, into upper and lower factors (c.f. step 2).        9. Proceed to the start of the iteration at step 1 in FIG. 5.        
FIG. 5 illustrates an outer iteration for solving linear equations. The steps include the following:                1. The iteration counter for the outer iteration, m, is set to 1.        2. If this is a two stage solver then skip to step 4.        3. Use preconditioner to find an approximate solution to the equation A·qm=rm-1 qm=B−1·rm-1         and then skip to step 7. If B were exactly equal to A, qm would take us directly to the solution.        4. If this is a two stage solver, form the residual, r′0, for the reduced equations by combining elements from the full residual, rm-1, using the CPR scale factors calculated at step 5 in FIG. 4.        5. Iterate to an approximate solution, x′, of the reduced equations using the procedure in FIG. 6.        6. Use the preconditioner for the full equations, together with x′, the approximate solution for the reduced equations, to obtain an approximate solution to the full equations.qm=B−1·(rm-1·A·x′)+x′        7. Use the matrix multiply routine to form the vector zm zm=A·qm         8. Find the scalar values α1 . . . αm which minimize the norm of rm where        
      r    m    =            r      0        +                  ∑                  i          =          1                m            ⁢                        α          i                ·                  z          i                                    9. If the norm of rm is below the convergence target then skip to step 12.        10. If the iteration limit has been reached, then skip to step 12.        11. Increment the iteration counter, m and return to step 2.        12. Exit with the solution:        
      x    m    =            x      0        +                  ∑                  i          =          1                m            ⁢                        α          i                ·                  q          i                    
FIG. 6 illustrates an inner iteration for solving linear equations. The inner iteration has logic which is very similar to that in the outer iteration. The steps include the following:                1. The iteration counter for the inner iteration, m, is set to 1. This is distinct from the iteration counter for the outer iteration. The initial residual r′0 is passed from the outer iteration.        2. Use a preconditioner to find an approximate solution to the equation C·q′m-1:q′m=D−1·r′m-1         If D were exactly equal to C, q′m would take us directly to the solution of the reduced equations.        3. Use the matrix multiply routine to form the vector zm:zm=C·q′m         4. Find the scalar values α1 . . . αm which minimize the norm of r′m where        
      r    m    ′    =            r      0      ′        +                  ∑                  i          =          1                m            ⁢                        α          i                ·                  z          i                                    5. If r′m is below the convergence target then skip to step 8.        6. If the iteration limit has been reached, then skip to step 8.        7. If the iteration limit has not been reached, increment the iteration counter, m and then return to step 2.        8. Exit with the solution:        
      x    ′    =            ∑              i        =        1            m        ⁢                  α        i            ·              q        i        ′            
The reduced equations have a reduced set of solution variables, typically including just the pressure for each grid-block. In mapping this to the full solution space, other variables are typically set to 0.
The above discussion does not specify what preconditioner is used, and we now consider a particular example. The Nested Factorization (NF) Algorithm is a known preconditioning method which may be used to accelerate the convergence of the conjugate gradient, GMRES, OrthoMin and other similar algorithms for iterative solution of large sparse sets of linear equations. The algorithm has been described. See, for example, Appleyard J. R., Cheshire I. M., and Pollard R. K. (1981) “Special techniques for fully-implicit simulators.” In Proc. European Symposium on Enhanced Oil Recovery Bournemouth, UK, pages 395-408. See also Appleyard J. R. and Cheshire I. M. “Nested factorization.” In Seventh SPE Symposium on Reservoir Simulation, pages 315-324, 1983. paper number 12264. Further, see P. I. Crumpton, P. A. Fjerstad and J. Berge: ‘Parallel Computing Using a Commercial Reservoir Simulator’, paper presented at ECMOR VIII, 3-6 Sep., 2002. The original NF algorithm described in these references assumes that the matrix is formed by considering connections between neighbouring blocks in a simple topologically rectangular grid.
For an nx by ny by nz grid, the banded coefficient matrix A isA=d+u+l+v+m+w+n where:
d is the diagonal,
u and l are the bands immediately above and below the diagonal, which connect adjacent cells within a line,
v and m are the bands nx elements away from the diagonal, which connect adjacent lines within a plane,
w and n are the bands nx*ny elements away from the diagonal, which connect adjacent planes.
FIG. 7 shows an example of a 4×4×2 rectangular grid with NF ordering. The matrix for the grid has the structure shown in FIG. 8, based on an ordering of cells as shown in FIG. 7. This matrix can be regarded as recursively block tridiagonal. At the outermost level each block element represents a plane, or the coefficients connecting adjacent planes. Within each plane, the blocks represent lines, and coefficients connecting adjacent lines. At the innermost level, there are individual grid blocks and the connections between individual grid blocks within lines.
NF exploits this recursive structure by defining the preconditioner, B, recursively:B=(I+n·P−1)·(P+w)(B is block tridiagonal—block elements are planes)where I is the unit matrixand P=(I+m·T−1)·(T+v) (P is block tridiagonal—block elements are lines)and T=d+l+u-approx(ε) (T is tridiagonal)and ε=m·T−1·v+n·P−1·w (the uncorrected error matrix)approx(ε) is a matrix which is computed to approximate ε. It is usually diagonal, but could be tridiagonal, or even more complex. Common approximations are:diag(ε) the diagonal elements of εrowsum(ε) the diagonal matrix formed by summing the elements of ε in rowscolsum(ε) The diagonal matrix formed by summing the elements of ε in columns.zero(ε) a zero matrix.
Any of these approximations could be modified by a relaxation factor as described in Kumar, P., Grigori, L., and Niu, Q. (2009) “Combinative preconditioning based on Relaxed Nested Factorization and Tangential Filtering preconditioner” INRIA Rapport de recherche, ISSN 0249-6399.
After expansion, the following is obtained:B=A+ε−approx(ε)
In summary, B is a partial L·U factorization, but unlike most other similar methods, the factors L and U are block (rather than strictly) lower and upper triangular. For the example 4×4×2 finite difference matrix, (I+n·P−1) and (P+w) are 2×2 matrices:
where Pa and Pb are themselves matrices which connect cells within the 2 planes, and n, w are diagonal matrices containing elements from A. To compute B−1·x, where x is a vector, first sweep forward through the planes to obtain (I+n·P−1)−1·x, and then back to apply (P+w)−1. The problem is thus reduced to solving P−1·x for each plane, and from three dimensions to two.
For a suitable choice of P (P=A−n−w−n−P−1·w), this factorization of A is exact. However an exact factorization is not practical for large matrices, and instead, each block element of P is approximated in the same way—with an incomplete block L·U factorization.Pa=(I+m·T−1)·(T+v)
For the example matrix, (I+m·T−1) and (T+v) are 4×4 matrices:
where Ta, Tb, Tc and Td are themselves 4×4 matrices which connect cells within a line, and m, v are diagonal matrices containing elements from A. The evaluation of Pa−1·x proceeds in the same way as that of B−1·x, and the problem is further reduced from 2 dimensions to 1. The solution of the one-dimensional problems, which are of the form Ta−1·x (where Ta is tridiagonal) is then a simple matter which terminates the recursion. There are three levels of recursion because the computational grid occupies a three-dimensional space.
In fact, the NF preconditioner can be shown to be equivalent to a strictly triangular factorization, but with L and U factors that include a good subset of the fill-in bands produced by a Cholesky factorization. However the computational work and storage required by NF is much less than would be required to calculate the fill-ins. This explains why NF performs better than ILU methods.
The NF algorithm was designed around rectangular grid geometries, and the innermost loops involve the solution of tridiagonal matrices representing lines of grid blocks. In practice, NF has been extended from this simple case in a number of ways, for example:                Extension to coupled “multi-phase” problems with several independent variables (e.g. pressure, temperature and composition) in each grid block. In this case individual matrix elements are small (e.g. 4×4) dense matrices rather than scalars, but the algorithm is otherwise unchanged.        Accommodation of “non-neighbour” connections that can be used to introduce some more complex geometries, such as local grid refinements.        Accommodation of “nine-point” finite difference terms (non-zero terms connecting diagonally adjacent grid blocks).        Extra levels of recursion to deal with some classes of problem which can be thought of as four or five dimensional (multiple porosity models).        Inclusion of higher order error correction terms on the/and u terms in the tridiagonal.        A method has been described for applying NF to 2D unstructured meshes and 3D prismatic meshes. See Kuznetsova, N. N. et al, (2007) “The Family of Nested Factorizations”, Russ J. Numer. Anal. Math. Modelling, Vol. 22, No. 4 pp 393-412. The method retains the assumption of tridiagonal inner blocks.        US Patent Application Publication 2009/0138245 filed by John Appleyard, which is incorporated herein in its entirety, extends the method by allowing for unstructured grids, and for additional options when dealing with structured grids.        
In addition, there have been several implementations of versions of NF which are adapted to run on parallel machines. These include:                Versions designed to work on “cluster” computers using domain decomposition (see e.g. P. I. Crumpton, P. A. Fjerstad and J. Berge: ‘Parallel Computing Using a Commercial Reservoir Simulator’, paper presented at ECMOR VIII, 3-6 Sep., 2002), with periodic exchange of data between the boundary regions of adjacent domains. Multi-color orderings may be used to allow communication and computation to be overlaid. These methods can work very well for small clusters—up to a few dozen processors—but may not scale well beyond that.        Twisted factorization may be used to allow the work in each level of recursion to be split into 2 parts which can be executed in parallel. For a 3D problem, with 3 levels of recursion, this means that the work can be shared between 8 processors. This method is described in H. van der Vorst's “Large tridiagonal and block tridiagonal linear systems on vector and parallel computers” published in Parallel Computing Vol. 5, 1987, pp 45-54 (hereafter “van der Vorst”).        
Neither of these techniques is appropriate for implementation on massively parallel architectures, such as those of modern GPUs, which require hundreds, or, preferably, thousands of threads to be active simultaneously.