An elusive goal of reservoir simulation has been the ability to efficiently utilize massively parallel processing. The recent advent of the heterogeneous processor with Central Processing Units (“CPU”) tightly coupled to Graphics Processing Units (“GPU”) has complicated this goal. An additional complication is that the next generation reservoir simulator must be able to easily handle unstructured grids such that complex geologies and well configurations can be accurately modeled.
Although existing simulators can handle unstructured grids, further advances are required to fully and efficiently utilize the most current computer hardware. One problem, for example, plaguing unstructured simulators is the inability to adequately distribute (balance) the computational load (data) among thousands of computational engines without totally disrupting the solution algorithm. Without a proper load-balancing strategy, efficient massively parallel processing on multiple GPUs, also referred to as a GPU cluster, becomes impossible.
Domain decomposition techniques are a natural way to efficiently use parallel computers with attached general purpose GPUs. Domain decomposition techniques divide a computation into a local part, which may be done without any inter-processor communication, and a part that involves communication between neighboring and distant processors. In addition to achieving computational load balance, decompositions based on the Fiedler Vector, like the decomposition illustrated in FIG. 1, have been shown to exhibit characteristics on parallel computers such as a GPU cluster that are well-behaved for linear solvers. Unfortunately, however, the computational load for reservoir simulation is dynamic, especially in situations involving geomechanical compaction and multi-component near-critical flash calculations. This issue is addressed by the dynamic load-balancing technique described in the thesis entitled “Dynamic Load Balancing Using a Transportation Algorithm” by Tongyuan Song and John Killough (“Song and Killough”). This load-balancing strategy can efficiently utilize the heterogeneous GPU cluster by allocating repetitive imbalanced calculations to GPU's while the CPU is performing other tasks.
Unfortunately, the very act of attempting to achieve load balance often destroys the ability of conventional unstructured solvers to converge efficiently. Even multi-colored unstructured solvers such as those based on incomplete LU factorization (“ILU”) or algebraic multigrid (“AMG”) cannot adequately cope with the poorly structured matrix, which results from a decomposition with thousands of domains. Although AMG may continue to converge well, the message-passing overhead associated with a large decomposition significantly reduced its parallel efficiency. To overcome this, an algorithm that takes advantage of the heterogeneous GPU cluster is required.
One particular algorithm that takes advantage of the heterogeneous GPU cluster is sometimes referred to as multi-grid semi-coarsening. The results of this algorithm are illustrated in comparing FIG. 2A (fine domains) and FIG. 2B (coarsened grid). A particular variant of this algorithm, which uses an intelligently coarsened grid to accelerate local Crout version ILU (“ILUC”) solves is described in the article entitled “Simulation of Compositional Reservoir Phenomena on a Distributed Memory Parallel Computer” by J. E. Killough and Rao Bhogeswara. For intelligent coarsening, one of the recent layer-lumping and/or single phase areal aggregation algorithms described in the article entitled “A New Efficient Averaging Technique for Scaleup of Multimillion-Cell Geologic Models” by D. Li and D. Beckner may be used. In this manner, the amount of overhead is drastically reduced while achieving a solution whose computational work for convergence does not depend on the number of domains. With the acceleration of the ILUC solves using the heterogeneous GPU cluster, overall GPU cluster performance is substantially reduced. This technique has shown to work well in Intel's Spike linear solver, for example. Solutions for complex wells such as Extreme Reservoir Contact (ERC) wells with many lateral branches are a natural consequence of this development. That is, the additional semi-coarsening step simply enhances the ability of the linear solver to properly account for the complications added to properly solve for the pressure and saturation distributions of such wells. The heterogeneous computational environment also adds to the ability for the simulator to tackle the surface network solution on the conventional computational engines while the subsurface is being primarily solved on the heterogeneous GPU cluster.
The algorithms discussed above by their very nature allow rapid software development on a GPU cluster because the computational load for each GPU is limited to specific small kernels, which can be easily coded with the existing languages available for efficient parallel GPU software. With the combination of the existing message-passing interface (“MPI”) based higher-level language for the cluster, the software implementation for the overall massively parallel simulator becomes readily achievable without having to wait for additional software tools to be developed. Previous implementations of these algorithms, however, required the use of the host CPU or processor and thus, reduced overall efficiency of the algorithms.