A system of linear equations is defined as
                                                                                                                 a                    11                                    ⁢                                      x                    1                                                  +                            ...                        ⁢                                                  +                                          a                                  1                  ⁢                  n                                            ⁢                              x                n                                              =                      b            1                                              …                                                                                                                      a                                          m                      ⁢                                                                                          ⁢                      1                                                        ⁢                                      x                    1                                                  +                            ...                        ⁢                                                  +                                          a                mn                            ⁢                              x                n                                              =                                    b              m                        .                              where (letting j=1, . . . , n and i=1, . . . , m) the unknowns xj are solved in terms of the numbers aij and bj. These equations may be more compactly written as the matrix equation Ax=b upon defining
      A    =          [                                                  a              11                                            …                                              a                              1                ⁢                n                                                                          …                                …                                …                                                              a                              m                ⁢                                                                  ⁢                1                                                          …                                              a              mn                                          ]        ,      x    =          (                                                  x              1                                                            …                                                              x              n                                          )        ,      b    =                  (                                                            b                1                                                                        …                                                                          b                m                                                    )            .      
Note that the coefficient matrix A is m-by-n, x is an n-dimensional vector, and b is an m-dimensional vector. For purposes of illustration through the present application, the sizes of A, x and b will be fixed as they are here defined. No assumptions are made regarding the structure of the matrix; it may be dense, sparse, symmetric, etc.
In some cases one needs to solve Ax=b for multiple b using the same A. This is equivalent to solving the matrix equation AX=B, where X and B are now non-vector matrices. When there are p instances of b, they become the p columns in the m×p matrix B; X is n×p; and A is still m×n. Also, note that Ax=b may be thought of as a special case (p=1) of AX=B, and AX=B may be thought of as Ax=b repeated p times (with the same A).
The entries in A, b, and x may be real numbers, complex numbers, or a member of some other number field. In addition, some or all of the entries in these quantities may be left in variable form. Symbolic manipulation programs are able to compute with them in these cases. However, for purposes of clarity in the examples to follow it will be assumed that all numbers and variables belong to the complex number field.
An algorithm is also known as a method, a procedure, or a technique. It consists of a list of steps, conditions, manipulations, etc., that detail what should be done to take some input and produce the desired output. Following Sipser, the algorithms will be described at multiple levels (in this case, a “high” level and an “implementation” level). Also, it is pointed out that an algorithm can have seemingly unequivalent implementations, even though they represent the same computation.
Solving a system of linear equations is a relatively basic mathematical operation and so appears in applications across a multitude of quantitative fields. These fields include, but are not limited to: all branches of engineering (civil, mechanical, electrical, aeronautical, etc); applied areas of the sciences (physics, chemistry, biology, etc); applied and computational mathematics, control and estimation problems, medicine, economics, business, sociology, politics, and a number of technologies and industries, etc. Part of the reason for the ubiquitous presence of these linear algebra techniques is that they are used within other numerical techniques, which themselves have many applications. For example, they play a fundamental role in linear programming, Markov processes, the finite element method, control theory, etc. These applications will now be discussed.
Linear Programming
The field of linear programming (LP) is concerned with (linearly) optimizing allocations of limited resources to meet given objectives. It is used in the fields of operations research, microeconomics, transportation, business management, network flows, multi-commodity flows, resource management (in operations research), and other fields. Particular examples include food blending, inventory management, portfolio and finance management, resource allocation for human and machine resources, planning advertisement campaigns, etc. An LP problem may be formulated as finding x such that Ax=b, x≧0, and cTx is maximized (c is an n-dimensional vector). Constraints that are expressed as Ax≦b can be rewritten as Ax=b by appropriately increasing the dimension of x. An example application is an activity-analysis problem, which might involve a manufacturing company that converts a set of resources (such as raw material, labor, equipment) into a number of commodities; the company's goal is to produce that combination of commodities which will maximize its profit. Let A, b, and x be as defined earlier. In relation to the formulation given above: m is the number of resources; n is the number of commodities; aij is the amount of resource i required to produce one unit of commodity j; bi is the available amount of resource i; cj is the profit per unit of commodity j; and xj is the amount of the j-th commodity.
LP may be interpreted as a certain problem in game theory—a two person, zero-sum game with mixed strategies. In general, game theory relates to games of strategy, in which the participants have a conflict of interests; it is commonly applied to economic, social, political, and military situations. In this context, the participants are referred to as “players”; the first player here is named player-1. The probability player-1 will use strategy j is xj; A is now the payoff matrix. The expected payoff to player-1 if he uses strategy i is Σjaijxj, which is expected to be less than some number M. Also, player-1 tries to maximize M while the second player tries to minimize it.
Markov Chains
Markov chains (MCs) are a formulation of a system in which the states of the system and transitions between them are expressed probabilistically. MCs are formulated using the matrix equation Tx=x′, where x(x′) is the original (new) state vector and T the transition matrix. Thus xi is the probability to be in state i, and the probability to go from state j to i is the (i, j)-th entry in T. After many updates of x, the system may reach an equilibrium, which is characterized by Tx=x. This is equivalent to Ax=b upon defining A=T−I and b=0. In this case one is interested in the homogeneous solution xh. Also, one might in some cases be interested in the fundamental matrix, which involves the computation of an inverse (which is a by-product of some of the algorithms being introduced). Applications of this technique are widespread, including the fields of internet search engines, biological modeling, psychology and genetics, and a number of other fields. When applied to search engines, xi represents the probability the user (of the search engine) is at web site i. The probability that he will then go to web site j is represented by the (j, i)-th entry of T.
Finite Element Method
The finite element method (FEM) is used for finding numerically stable, approximate solutions of integral and partial differential equations. It is especially useful over complex domains. Common applications include structural engineering, the airflow around cars, the flow in oil pipelines, stresses in physical bodies, and magnetostatic configurations. After choosing a mesh and a parameterization (u) for the function of interest, one generically obtains a system of linear equations−Lu=Mf, and solves for u. For example, the vector u might refer to the nodes of a triangulated surface. The matrices L and M are referred to as the stiffness and mass matrices, respectively. The relation of this matrix equation to Ax=b is manifest.
Control Theory
Control theory deals with controlling a dynamic system and estimating its behavior, all in the presence of uncertainty. It has been applied to a great many fields, including engineering, science, economics, sociology, and politics. Some example applications include trajectory control for vehicles and robots, process control, decision making strategies for economic and social systems, certain stochastic decision systems in artificial intelligence, etc. A simplified model that nevertheless captures the spirit of some of these applications begins with the equation y=f(x), where the output vector y is a function (f) of the input vector x. After linearizing about a stable point, this takes the form y=Ax. The goal could be to find the input vector x such that a certain output vector y is followed. For example, given the target trajectory of a vehicle (y), determine the appropriate set of inputs (x). Mathematically, this involves the inversion of the matrix A, and if A is singular it involves the generalized inverse of A. This type of situation could be handled by some of the new algorithms introduced herein.
Other Applications
Other applications include those where sparse matrices arise. Specialized methods exist for solving these, but they at some point require the solution of Ax=b for dense A. Sparse matrices commonly arise in the solution of integral and differential equations, which is known to occur throughout engineering and applied sciences. Another application is within linear predictive coding, where it is used to help fit sampled speech to a model of the vocal tract, for example. Finally, other assorted applications include Leontief economic models, data fitting, least-squares analysis, tomography, interpolation, computer simulations, scientific software packages, etc.
What is common to many of these applications is that a solution is needed as quickly as possible for as large a system as possible. The reason for this is that many of these applications are models or simulations of some real physical system. To represent such a system more accurately, smaller grid-sizes must be used, which means a larger matrix equation to solve. For example, halving the grid-size in a 3-dimensional simulation leads to an eight-fold increase in the number of unknowns.
To meet this need, computers have been made faster mainly by increasing the frequency of the central processor and by changing the architecture. This approach has been successful, but has basically reached an end. The reason for this is that increasing the frequency also increases the power consumption, which increases the temperature of the processor and the likelihood for failure.
Parallel Computing
In discussing the efficacy of certain computational systems and methods, it is necessary first to classify the types of computers available. To begin with, the traditional computer operates through sequential execution of a single instruction stream. This results in data movement from memory to functional unit and back again. In more detail these steps are:                fetching and decoding a scalar instruction;        calculating addresses of the data operands to be used;        fetching operands from memory;        performing the calculation in the functional unit; and        writing the resultant operand back to memory        
In contrast to the sequential execution computer is a parallel execution computer. A parallel computer can be understood as a set of processes that can work together to solve a computational problem.
There are many varieties of parallel computers. Even with a single processor, there are many architectures that can be used to allow concurrent execution, including redundant functional units, pipelining, multiple pipelines, overlapping, RISC (superscalar processor), very long instruction word (VLIW) architectures, vector instructions, chaining, memory-to-memory and register-to-register organizations, register set (vector registers), stripmining, reconfigurable vector registers, various memory organizations (e.g., hierarchical), instruction-level parallelism, and others. Throughout this application, a single processor may be referred to as a “processor module,” or simply a “processor.”
When dealing with multiple processors/processor modules, two of the four Flynn architectures should be mentioned. First, there is SIMD (single instruction stream/multiple data stream), which has multiple processors and parallel memory modules managed by a single control unit. These are also referred to as array processors, of which vector processors are a subclass. In the MIMD (multiple instruction stream/multiple data stream) classification, processors are connected to memory modules via a network. MIMD machines can be further classified according to how the memory modules are organized (i.e., shared or distributed). By comparison, the traditional, sequential execution computer is classified as SISD (single instruction stream/single data stream).
Finally, there are a number of other computers which are hybrids according to the previous classifications. Moreover, memory can be local or global, or a combination of the two. Throughout this application, a single integrated memory unit may be referred to as a “memory module,” or simply as a “memory.” Note that computers can be named differently, while still being a “parallel computer”—examples of this are multi-core, many-core, supercomputers, mini-supercomputers, vector mainframes, novel parallel processors, multiprocessor with a vector processing chipset, etc.
Other aspects of parallel computers not discussed herein are treatments of the address space, communication and connection topologies.
Efficient Computing
During a typical computation on a computer, data flows into and out of registers, and from registers into and out of functional units (e.g. arithmetical logic units), which perform given instructions on the data. This leads to the situation where, especially on hierarchical memory machines, the performance of algorithms is as much determined by the cost of memory traffic as by the cost of arithmetical operations. As a consequence, the strategy has emerged of trying to make maximal use of the data while it is in the registers, before transferring it back to memory.
When a computational problem can be (naturally) cast as operations between matrices, reusing data becomes especially feasible. In that case one or more matrices are partitioned into blocks, and the operations on the original matrices are instead performed on the blocks. This allows parallel computation at two levels (the block level, and within each block). Thus, if a computational problem can be in large part cast as operations between matrices, this blocking technique can be used and the organization of memory can be exploited. This is best done with algorithms that are naturally formulated in terms of matrices (such as those discussed in this application)—otherwise, too much time is lost in rewriting the problem to appear as a matrix computation. For this reason, the new systems and methods introduced herein are ideally suited for more efficient computation on parallel computers.
The efficiency of computations is directly related to another significant issue: heat production. To illustrate the problem, a single query on the Google search engine can require computations that consume 1000 Joules of energy. This is equivalent to a production of 0.2 grams of the greenhouse gas CO2. On the processor level, excessive heat generation has been the root cause of the decline of the use of “frequency scaling” as a means of increasing a CPU's performance. It has been cited as a reason for using computers with a parallel architecture. On the system level, computers are now being manufactured with water-cooling to counter heat generation.
As stated earlier, certain methods are better suited for certain machines (e.g., matrix-based methods for parallel computers). Depending on the method that is used on a given machine, the machine may operate more or less efficiently—as a result it may require varying amounts of energy to operate, and produce different levels of greenhouse gases. Consequently, efficient methods (such as those that have been introduced herein) may be described as means for reducing energy consumption and environmental impact while solving a given task on a given computer.
Vector and Matrix Operations and Solutions
Computing numerical linear algebra on parallel computing devices is made easier by standards such as the Basic Linear Algebra Subprograms (BLAS), which has become the de facto Application Program Interface (API) for libraries that perform basic vector and matrix operations. BLAS is characterized by three levels: level-1 is for certain basic vector operations; level-2 for certain vector and matrix operations; and level-3 for certain basic matrix operations. Normally, higher levels of BLAS run more efficiently. This is due in part to a rewriting of matrix multiplication that can take advantage of a parallel execution.
Many vendors have produced highly optimized implementations of this standard on their machines, and the industry-wide emphasis has been to write programs in terms of BLAS calls. For example, the software library LAPACK uses BLAS, and is used on modern cache-based architectures. BLAS, which is for shared memory systems, has been extended to a parallel version PBLAS, which is meant for distributed memory systems. PBLAS is implemented in ScaLAPACK, for example. Given this situation, the level of description of the flowcharts for the new algorithm will be in terms of vector and matrix operations. Finally, it should be emphasized that researchers in the field of numerical linear algebra have lamented that there is an unfilled need for algorithms that show more parallelizability. To date, the algorithms in high use are basically the original serial algorithms adapted for parallel computing. Because the industry has reached a point where new, versatile algorithms for numerical linear algebra are clearly needed, it becomes necessary to develop all promising candidates.
A solution to Ax=b is defined as the determination of x (and/or related quantities) given A and b. Also, the problem can be cast into the form Ax=b, even though it may initially appear differently; examples include bT=xTA; Ax=x; and x=Ax+b. In these examples and in others, it is possible to rewrite the equations in the form Ax=b.
When the equations in Ax=b are consistent, the solution (x) exists and may be generally expressed as the sum of two terms: x=xp+xh, where xp is the particular solution (satisfying Axp=b), and xh is the homogeneous solution (satisfying Axh=0). Likewise, when AX=B is being solved, X, Xp, and Xh take the place of x, xp and xh. In the case when A is invertible, the solution is unique and equals x=A−1b. When A is not invertible, a generalized inverse G may be computed to play an analogous role to the inverse (i.e., the particular solution may be expressed as xp=Gb). In this case, the homogeneous solution may be expressed as xh=Py, where P is the null space projection operator and y is an arbitrary n-dimensional vector.
The above discussion points out that a solution is made up of the features xp, xh, G, and P, and possibly other related quantities as well. Thus, when a practitioner in the field is seeking a solution, he or she in fact may only be interested in some subset of all possible features. This subset will hereafter be referred to as the “feature set”, and in the next section it will be specified by the user near the beginning of an algorithm (when relevant) to define the objective of the algorithm.
Today, algorithms that solve linear systems of equations may be usefully classified as either iterative or direct (although there are some with features of both). The distinguishing characteristic of iterative algorithms is that they gradually converge on a solution after beginning from some starting condition. Direct methods, on the other hand, provide an answer in an estimable number of steps. Depending on the application, direct or iterative may be preferred. For example, when there is sufficient memory, and solutions for multiple b and dense A are sought, direct methods are generally preferred. However, when the matrix A is sparse, or too large for memory, then iterative methods are normally preferred. However, note that direct methods are often used at some point in the solution of a sparse system. Because certain embodiments introduce a new direct method for dense A, it is the existing set of direct methods (for dense A) that are of interest here.
Traditionally, software has been executed on uniprocessor machines, in which the instructions are executed sequentially. In that setting it was enough to judge them based on the number of flops (floating point operations) used, their stability, and their memory usage (and perhaps secondary criteria such as portability, simplicity, flexibility, etc). However, these metrics are no longer good predictors of performance on modern machines. As discussed above, today's high performance machines often employ vector processors, distributed and hierarchical memory, as well as a number of different connection topologies, and the individual steps of the algorithm must now be re-organized, as much as possible, to be done in parallel and with a minimum of data movement and communication. Thus algorithms are better evaluated in terms of their parallelizability. This may be estimated by metrics such as the speedup, efficiency, isoefficiency, and scalability, as well as “strong” and “weak” scaling. Finally, given that many parallel programs are written in terms of vector and matrix operations (e.g., using BLAS), it becomes reasonable to estimate the performance of an algorithm (in part) by the degree to which it can be naturally cast in terms of matrix and vector operations.
There exist a number of algorithms which provide a direct solution to Ax=b. Judging by their appearance in well-known software libraries (e.g., LAPACK, PLAPACK, and ScaLAPACK), the LU and QR methods are the most useful. There are a number of other methods, and they will now be reviewed. Of note is that many of them (including LU and QR) require a solution by back-substitution, which is not an intrinsically matrix/vector operation. It has however been cast as a BLAS call (because of its high use), although it has only limited parallelism.
The LU method is based on a matrix decomposition of the form A=PLU, which expresses A as a product of a permutation matrix P, a lower triangular (L), and an upper triangular (U) matrix. Once the decomposition is formed, it can be substituted into Ax=b, and x can be found by solving two sets of triangular equations (for y in Ly=PTb, and then for x in Ux=y). The LU method with partial pivoting (e.g., with the use of P as shown above) normally leads to a stable computation. However, this is not provably the case, and examples exist in which the method produces an incorrect result. The LU method seems to have remained in use because of its low flop count and traditional use. The LU algorithm is implemented in a variety of ways: sequentially, in parallel, as right- and left-looking algorithms, and with the Crout and Doolittle algorithms, for example. Algorithms closely related to LU are the Gaussian elimination (GE) and Gauss-Jordan (GJ) methods. The LU is preferred over GE since LU can reuse some of its computations when solving Ax=b for multiple b. Also, LU is preferred over GJ since LU has a lower floating point operation (“flop”) count.
Another oft-used algorithm is the QR method, which employs the factorization A=QR, where Q has orthonormal columns and R is upper triangular. The factorization is used to rewrite Ax=bas Rx=QTb, which is a triangular system of equations that can be solved by substitution. There are other versions of the QR algorithm (e.g., the RQ, QL, and LQ algorithms), which also lead to a triangular system of equations. The factorization is derived by applying a method such as the Gram-Schmidt orthogonalization procedure, Householder reflections, or Givens rotations. There are multiple versions of each of these as well as parallel implementations. Note that when a solution is sought for multiple b, then a system of triangular equations must be solved for each b. Also, when A is not of full rank, then column permutations are normally used. In this situation, other orthogonal factorizations may be used, such as the WZ, UTV or SVD methods; they are relatively slower.
Other methods include Cramer's rule, and the use of the matrix inverse. Both however use a determinant, which involves a much greater number of computations, as well as an increased likelihood for over/underflow. Neither is normally used for large computations. Other methods which are known but not widely used (for varying reasons) include: the Faddeeva algorithm, the quadrant interlocking factorization, the modified parallel Cramer's rule, the (parallelized) bi-directional methods, and a method based on *-semirings. There are numerous other less well-known algorithms that exist but for various reasons are not preferred.
The most straightforward computations of a generalized inverse proceed from a partitioned or factorized matrix (e.g., with QR, LU, or the singular value decomposition). There are a number of other approaches that are based on partitioned matrices. Generalized inverses are classified according to which of the four Penrose conditions they satisfy. The conditions involve the matrix A and its generalized inverse G:
(1)AGA = A(2)GAG = G(3)AG = (AG)*(4)GA = (GA)*where the superscript * indicates the transpose of the complex conjugate. Following convention, a generalized inverse that satisfies, for example, the 1st, 2nd, and 4th condition will be referred to as a {124}-inverse. Different generalized inverses have different properties, making some preferred over others, depending on the application. For example, a {14}-inverse will produce a solution with a minimal norm, and a {13}-inverse will lead to a least-squares solution. Note that {124}-inverses are a type of {14}-inverse, and {123}-inverses are a type of {13}-inverse. Finally, there does not appear to be a simple, direct means of computing a generalized inverse of type {124} or {123}.