The present invention relates generally to a batched Cholesky decomposition method on a graphics processing unit (GPU), and more particularly, but not by way of limitation, to a system, method, and recording medium for combining two symmetric and positive definite (SPD) matrices into one rectangular (or square) matrix to accelerate batched dense Cholesky decomposition on a GPU by solving both symmetric positive definite matrices (e.g., two problems) at the same.
Rapid evolution of GPUs in performance, architecture, and programmability provides general and scientific computational potential far beyond their primary purpose, graphics processing. Conventionally, Cholesky decomposition has been considered as an algorithm for solving symmetric and positive definite linear systems using the GPU.
Cholesky decomposition is conventionally complex because the process requires three-routines (e.g., square rooting, normalizing, and subtracting inner product or updating a submatrix), the memory access pattern is sub-optimal, and there is a high thread divergence.
Conventional techniques have attempted to improve Cholesky decomposition by, for example, forward and backward substitution, which can be used for various purposes such as for equalization, filtering data, and reconstructing data. Such techniques consider a way to speed up Cholesky decomposition by proposing a Single-Instruction-Multiple-Data (SIMD)-like special functionality, which requires a new-type of hardware or modification to an existing hardware, and does not consider a batched problem.
FIG. 2 exemplary shows a Cholesky decomposition of related art. As shown, updating a global memory ‘B’ is a problem because global memory is not as efficient for processing as a shared memory ‘A’ (i.e., on-chip, etc.) because there are fewer valid elements as the steps go from “step” 0 to “step i”. Also, there is thread divergence because an update (or no update is based on the step. Further, there is an issue of load balancing as in the thread nearest “X” would do nothing after “step i” is complete and waits until “step 0” completes. This leads to synchronization strain (e.g., for every computation, three synchronizations are needed).