Microprocessors, commonly referred to as processors, represent the central processing unit of a computer and were originally developed with only one core. Multicore processors were developed in the early 2000s and may have two cores, four cores, six cores, eight cores, ten cores, or more. A multicore processor implements multiprocessing in a single physical package and cores may be coupled tightly or loosely where cores may or may not share caches, may implement message passing or shared memory inter-core communication methods, etc. and are generally interconnected using bus, ring, two-dimensional mesh, and crossbar architectures. However, the improvement in performance gained by the use of a multicore processor depends on the software algorithms used and their implementation where possible gains are limited by the fraction of the software that can be run in parallel simultaneously on the multiple cores. In the best case, so-called embarrassingly parallel problems may realize speedup factors near the number of cores, or even more if the problem is split up enough to fit within each core's cache(s), avoiding use of much slower main system memory. Most applications, however, are not accelerated so much unless programmers invest a prohibitive amount of effort in re-factoring the whole problem.
Such an issue arises with Finite Element Method software performing finite element analysis (FEA) wherein mesh generation techniques divide a complex problem into small elements together with material properties and underlying physics such that the problem is reduced to solving a series of algebraic equations for steady state problems or ordinary differential equations for transient problems. However, conventional prior art FEM software relies upon performing global and sparse algebraic operations that severely limit its parallel performance. Within the prior art the efforts in re-factoring the problem have focused to improving the performance of conventional sparse computations at the expense of sophisticated programming techniques tailored to specific CPU hardware architectures, such as cache access optimizations, data-structures and code transformations such that code portability and reusability are limited. For example, implementations of Conjugate Gradient (CG) solvers for FEM problems require global sparse operations which perform at a low fraction of the peak CPU computational throughput. Further accelerating CG solvers on parallel architectures is communication limited thereby generating a subset of prior art attempting to improve the communication overhead of such sparse solvers through reformulation, namely communication reducing schemes, which typically suffer from numerical instability and limited support for pre-conditioners. These performance bottlenecks are even more pronounced in high accuracy FEM analysis as the increased number of elements yields a large number of unknowns, in the order of millions or more, which prevents FEM software users from productively utilizing parallel multicore high performance computing platforms.
Prior art generic and optimized FEM libraries such as deal.II, GetFEM++, and Trilinos whilst useable for sparse FEM computations; obtaining a sustained performance can be difficult due to the varying sparsity structure for different application areas. Further, such libraries do not help with the costly stage of assembling the sparse matrix from the generated elements. Whilst a matrix free (MF) approach to execute the sparse matrix-vector multiply (SMVM) kernel in the CG solver has been reported within the prior art and shows promising speedups, it does not depart from the sequential global algebraic setup of the CG solver and may only be efficient for high order elements.
Accordingly, it would be beneficial to reformulate the FEM problem such that the message passing issue is addressed rather than seeking solutions that avoid message passing and communications. Accordingly, the inventors have established a novel distributed FEM reformulation using belief propagation (BP) that eliminates the dependency on any sparse data-structures or algebraic operations; hence, attacking a (the ?) root-cause of the problem. Belief propagation, strictly the belief propagation algorithm, is a message passing algorithm based upon graphical models that efficiently compute the marginal distribution of each variable node by recursively sharing intermediate results. BP has demonstrated excellent empirical results in other applications, such as machine learning, channel decoding, and computer vision. A Gaussian BP algorithm proposed by Shentel et al. in “Gaussian Belief Propagation Solver for Systems of Linear Equations” (IEEE Int. Symp. on Inform. Theory (ISIT), 2008, pp. 1863-1867) operates as a parallel solver for a linear system of equations by modeling it as a pairwise graphical model. Whilst showing promise for highly parallel computations on diagonally dominant matrices the Gaussian BP does scale for large FEM matrices and also fails to converge for high order FEM problems. Significantly, such a solver still requires assembling a large sparse data-structure.
In contrast the Finite Element Gaussian Belief Propagation (FGaBP) algorithm and its multigrid variant, the FMGaBP algorithm introduced by the inventors are distributed reformulations of the FEM that result in highly efficient parallel implementations. The algorithms according to embodiments of the invention beneficially provide a highly parallel approach to processing the FEM problem, element-by-element, based on distributed message communications and localized computations. This provides an algorithm amicable to different parallel computing architectures such as multicore CPUs, manycore GPUs, and clusters of both.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.