Many types of physical processes, including fluid flow in a petroleum reservoir, are governed by partial differential equations. These partial differential equations, which can be very complex, are often solved using finite difference, finite volume, or finite element methods. All of these methods divide the physical model into units called gridblocks, cells, or elements. In each of these physical units the solution is given by one or more solution variables or unknowns. Associated with each physical unit is a set of equations governing the behavior of these unknowns, with the number of equations being equal to the number of unknowns. These equations also contain unknowns from neighboring physical units.
Thus, there is a structure to the equations, with the equations for a given physical unit containing unknowns from that physical unit and from its neighbors. This is most conveniently depicted using a combination of nodes and connections, where a node is depicted by a small circle and a connection is depicted by a line between two nodes. The equations at a node contain the unknowns at that node and at the neighboring nodes to which it is connected.
The equations at all nodes are assembled into a single matrix equation. Often the critical task in obtaining the desired solution to the partial differential equations is solving this matrix equation. One of the most effective ways to do this is through the use of incomplete LU factorization or ILU, in which the original matrix is approximately decomposed to the product of two matrices L and U. The matrices L and U are lower triangular and upper triangular and have similar nonzero structures as the lower and upper parts of the original matrix, respectively. With this decomposition, the solution is obtained iteratively by forward and backward substitutions.
There is an ongoing need to obtain better solution accuracy. One way to do this is to divide the physical model into smaller physical units, or in other words to use more nodes, perhaps millions of them. Of course, the time needed to perform the computations increases as this is done. One way to avoid this time increase is to perform the computations in parallel on multiple processors.
There are two types of parallel computers, those using shared memory and those using distributed memory. Shared memory computers use only a handful of processors, which limits the potential reduction in run time. Distributed memory computers using tens of processors are common, while some exist that use thousands of processors. It is desired to use distributed memory parallel processing.
When using distributed memory, the computations are parallelized by dividing the physical model into domains, with the number of domains being equal to the number of processors to be used simultaneously. Each domain is assigned to a particular processor, which performs the computations associated with that domain. Each domain contains a specified set of nodes, and each node is placed in a domain.
The entire modeling process involves many computations, nearly all of which are performed node by node. Some of these computations at a node require only information local to the node. Information is local to a node when it is contained completely within the same domain as the node. These computations are sometimes called embarrassingly parallel, since they require no special treatment to perform in parallel. Other computations require information at the node and its neighbors. If the node is on the boundary of its domain with another domain, one or more of its neighbors will reside in the other domain. To perform computations that require neighbor information at boundary nodes, information about these neighbors must be obtained from the domains in which these neighbors reside. If the information needed is known in advance, it can be obtained easily by “message passing,” and the computations are easily parallelized. It is important that the information be known in advance because message passing takes time. In particular, there is a large, compared to normal computational times, latency; in other words, it takes a finite time for the first element of a message to reach its recipient. If the information is known in advance, the message containing it can be sent before it is needed in the other process. In this manner, it can arrive at the other process before it is needed.
Unfortunately, in factorization computations, the information needed is not known in advance. Instead, it is generated during the factorization. The computations are “inherently sequential.” The general flow of the computations is as follows:                1. Update the current node's equations based on computations performed at its neighbors that have already been factored.        2. Factor the resulting modified equations at the current node.        3. Provide information about the current node's factorization to its neighbors that have not yet been factored.“Neighbors” need not be immediate neighbors. They can be several nodes away.        
The sequential nature of these calculations is not a problem if there is one domain. The sequential nature of these calculations is a problem if there is more than one domain. Information must be sent from one process to another. If this information is not known until immediately before it is needed at the other process, there will be a delay while the message containing it is sent. These delays can be avoided if the computations are ordered such that any information to be sent to another process is known well before it is needed at the process.
To consider this further, assume two domains. Each domain has interior nodes that communicate only with nodes in the same domain and boundary nodes that communicate with nodes in both domains. The processing could be performed in the following order:                1. Process interior nodes in domain 1.        2. Process boundary nodes in domain 1.        3. Send boundary node information from domain 1 to domain 2.        4. Process boundary nodes in domain 2.        5. Process interior nodes in domain 2.        
If this order is used, domain 2 cannot begin its processing until domain 1 is completely finished. There is no parallelization of the computations at all.
A better processing order is as follows:                1. In parallel, process interior nodes in domain 1 and interior nodes in domain 2.        2. Process boundary nodes in domain 1.        3. Send boundary node information from domain 1 to domain 2.        4. Process boundary nodes in domain 2.        
Using this order, the calculations on interior nodes are performed in parallel. This is significant since there are more interior nodes than boundary nodes. However, the boundary nodes are still processed sequentially. Typically 20%-40% of total nodes are boundary nodes.
There are quite a few parallel algorithms for ILU factorization. In the book “Iterative Methods for Sparse Linear Systems (first edition)” written by Yousef Saad, Society for Industrial and Applied Mathematics, 1996 (“Saad”), two algorithms are introduced. One of them is the multi-elimination algorithm described on p.p. 368-372, which takes advantage of independent sets existing in the sparse linear system. However, this approach may be unsuitable for a distributed data structure. A second algorithm described on p.p. 374-375 factors interior nodes simultaneously on each processor then processes the boundary nodes in some order. The drawback of this second algorithm is that some processors remain idle while waiting for data coming from other processors. A third algorithm described in Parallel Threshold-based ILU Factorization by George Karypis and Vipin Kumar, 1998, Technical Report #96-061, is similar to the second algorithm, except that it colors the boundary nodes and then factors the nodes of each color. A few colors, however, may be required. As more colors are required, more messages must be passed between processors, usually impairing the overall performance of the solver.
There is therefore, a need for an improved parallel ILU factorization algorithm that is suitable for distributed sparse linear systems and reduces processing time.